
**Summary previous Lecture**<br>
NumPy's ndarray data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks.

**Limitations of numpy**
 - Cannot attach labels to data,
 - lacks flexibility to working with missing data
 - lacks flexibility to do operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.)
 

# <font color = 'dodgerblue'> **Pandas** </font>

Pandas is an open-source library that is built on top of NumPy library. 

**Why Pandas**
1. It provides an efficient implementation of multidimensional arrays (dataframes) with attached row and column labels.
2. These dataframes often consist of hetrogeneous or missing data.
3. Many of Excel's features, like making pivot tables, calculating columns based on other columns, drawing graphs, etc., can be done prorrammatically. 
4. You can also group rows by the value in a column or join tables together, just like in SQL. 
5. Pandas also does a great job with time series.

## <font color = 'dodgerblue'> **Importing a package** </font>

1. Importing the pandas is similar to importing the numpy package.


2. **Syntax**:



```
      import package_name as alias name
```


In [None]:
# Importing pandas package with alias name as pd


<details>
<summary>Click to expand code</summary>

```python
# Importing pandas package with alias name as pd
import pandas as pd
import numpy as np
</details>


In [None]:
# Checking the version of the pandas


<details>
<summary>Click to expand code</summary>

```python
# Checking the version of the pandas
pd.__version__
</details>



<details>
<summary>Click to expand code</summary>

```python
np.__version__
</details>


# <font color = 'dodgerblue'> **Pandas Objects** </font>
1. In numpy the arrays or matrices were identified by using the indices.
2. Pandas obejcts are enhanced version of numpy where the rows and columns are labeled.
3. The data can be accessed by using these lables for  columns/rows. 
4. These pandas objects are broadly classified as follows:
   * Pandas series object
   * Pandas dataframe object
   



## <font color = 'dodgerblue'> **Creating Pandas series object** </font>

1. **Pandas Series** is a one dimensional array of indexed data.
2. A series is a wrap of sequence of values and a sequence of indices for the given values.
2. It can be created from list, dictionary etc.
3. **Syntax** :
   

```
      pandas.Series(data, index)
```
where, it can take data as a list, dictionary, tuple etc and index can be defined explicitly.
4. To access the values of the series. We can use following code :
 
  **Syntax** :
```
    d = pandas.Series(data, index)
    d.values
```
values method gives the values of the series without the index.
5. Similarly, we can access the index of the series without accessing the values as follows :

 **Syntax**:

```
    d.index
```






###  <font color = 'dodgerblue'> **Creating series from list** </font>

In [None]:
# Creating series by using list 


<details>
<summary>Click to expand code</summary>

```python
# Creating series by using list 
my_list = [12, 0.25, 45, 36, 78, 20]
data1 = pd.Series(my_list)
print(f"The new pandas series is : \n {data1}")
</details>


- From the above example, it looks like that series is similar to one dimensional numpy array. The key diffeernce is that we can define the index explicitly in series. 
- Each item in a Series object has something called a "index label" that is a unique name for it. 
- By default, it's just the item's rank in the Series (which starts at 0), but you can set the index labels yourself:

###  <font color = 'dodgerblue'> **Specifying index** </font>


<details>
<summary>Click to expand code</summary>

```python
data1 = pd.Series([12, 0.25, 45, 36, 78, 20],
                  index = ['a', 'b', 'c', 'd', 'e', 'f'])
print(f"The new pandas series is : \n{data1}")
</details>


We can think of series as specialized dictionaries. We can infact construct series from dictionaries.

###  <font color = 'dodgerblue'> **Creating series from dictionary** </font>

In [None]:
# first lets create a dictionary
# Lets pass this dictionary as a series


<details>
<summary>Click to expand code</summary>

```python
# first lets create a dictionary
data2 = { 'a':'Pandas', 'b':'numpy', 'c' :'Packages', 'd':'Data processing'}
# Lets pass this dictionary as a series
my_dict_series = pd.Series(data2)
print(f"The series using dictionary is :\n{my_dict_series}")
</details>


###  <font color = 'dodgerblue'> **Series Vs. Dictionary** </font>

However, unlike dictionaries, we can do array like operations (for example slicing on series).


<details>
<summary>Click to expand code</summary>

```python
slice_dict = data2['a':'c']
print(f'Slice of dictionary : {slice_dict}')
</details>



<details>
<summary>Click to expand code</summary>

```python
slice_series = my_dict_series['a':'c']
print(f'Slice of dictionary :\n{slice_series}')
</details>


###  <font color = 'dodgerblue'> **Creating series using tuple** </font>


<details>
<summary>Click to expand code</summary>

```python
tup = (20, 15, 23, 45)
pd.Series(tup)
</details>


##  <font color = 'dodgerblue'>  **Creating Pandas Dataframe object**

We can think of Dataframe as two dimensional array with both flexible row indices and flexible column names.

 **Syntax** :


```
    pandas.DataFrame(data, index, columns, dtype)
```
where, 

  **data** = ndarray, Iterable like list, dict, or DataFrame.

 **Index** =
      Index to use for resulting dataframe. By default, it's just the item's rank (strating from 0)  if no index data is provided. We can think of this as row labels.

  **columns** = They are used to mention the columns labels. By default, it's just the item's rank (strating from 0)  if no labels are provided.



Pandas dataframe object can be created in various ways as follows :

<img src="https://drive.google.com/uc?export=view&id=1dsXSMVZ2PoiNo0Kr8KK8eOkeBTZQNby-" width="600"/>



####  <font color = 'dodgerblue'>  **Creating dataframe using a single series object**

* A dataframe is a collection of series obejcts.
* A single column dataframe can be created using  a single series.


In [None]:
# Craeting dataframe using a single series object
# creating a dataframe by using column labelling


<details>
<summary>Click to expand code</summary>

```python
# Craeting dataframe using a single series object
countries = pd.Series(['India', 'China', 'Austria', 'America', 'Australia'])
# creating a dataframe by using column labelling
dt = pd.DataFrame(data = countries, columns = ['countries'])
print(f"The dataframe using a single series object is \n {dt}")
</details>


####  <font color = 'dodgerblue'>  **Creating dataframe using numpy array**

* Let's take a two dimensional array of data.
* We can create a dataframe by specifying the columns and index names.
* If we do not specify the columns and index names then it will take the integer index for them.


In [None]:
# Creating dataframe using numpy array
# Lets import the numpy for creating array
# Creating arrays by using numpy
# rows = 4 ,columns = 5
# lets create the dataframe using array, column labels, and index for the given array


<details>
<summary>Click to expand code</summary>

```python
# Creating dataframe using numpy array
# Lets import the numpy for creating array
import numpy as np
# Creating arrays by using numpy
# rows = 4 ,columns = 5
my_arr = np.random.rand(4, 5)
# lets create the dataframe using array, column labels, and index for the given array
dt1 = pd.DataFrame(data = my_arr , columns = ['var1', 'var2', 'var3', 'var4', 'var5'], index = ['a', 'b', 'c', 'd'])
print(f"The dataframe created by using arrays is : \n {dt1}")

</details>


In [None]:
# As we can see below, If we do not specify the columns and index names 
# then it will take the integer index for them.


<details>
<summary>Click to expand code</summary>

```python
# As we can see below, If we do not specify the columns and index names 
# then it will take the integer index for them.
dt1 = pd.DataFrame(data = my_arr)
print(f"The dataframe created by using arrays is : \n {dt1}")
</details>


####  <font color = 'dodgerblue'> **Creating a dataframe using list**

A dataframe can be created using a single list or lists.

  **Syntax**:


```
  pandas.DataFrame()
```
We can pass the list or lists .




In [None]:
# Creating dataframe using lists
# We can observe that the column name is considered as 0
# Giving column label 


<details>
<summary>Click to expand code</summary>

```python
# Creating dataframe using lists

my_list = [14, 25, 36, 89, 12, 45]
dt = pd.DataFrame(my_list)

# We can observe that the column name is considered as 0
print(f"The dataframe using list without specifying column names\n {dt}")

# Giving column label 
dt1 = pd.DataFrame(data = my_list, columns =['Marks'])
print(f"\nThe new dataframe using the column name as Marks is: \n{dt1}")

</details>


####  <font color = 'dodgerblue'> **Creating DataFrame from dict of ndarray/lists:**
While creating dataframe from the dictionary of ndarray or lists we need to make sure that :
* All arrays must be of same length.
* If index is passed then length of the index must be equal to the length of the array.
* If no index is passed then by default it takes range of the length of the array.
* The key values of the dictionary will be treated as column labels

In [None]:
# Creating dataframe from dict of ndarrays/Lists
# initializing the data of list


<details>
<summary>Click to expand code</summary>

```python
# Creating dataframe from dict of ndarrays/Lists
# initializing the data of list
my_arr = {'col1' :[10, 45, 26, 36, 78, 46],
          'col2' :['anu', 'alex', 'apex', 'ho', 'hi', 'bye']
          }
dt_frame = pd.DataFrame(data = my_arr)
print(f"the new dataframe is \n {dt_frame}")
</details>


####  <font color = 'dodgerblue'> **Creating a dataframe using a list of dictionary**
* Previously, we have covered the dictionaries using list. 
*  Now, let us consider the dictionaries inside the list or we can say list consisting of dictionaries.

**Example** :

my_list = [{1 : 'hi', 2 : 'c'}]

Here, my_list consist of a dictionary


In [None]:
# Creating a dataframe using a list of dictionary
# First lets create a list consisting of dictionary
# lets create a dataframe using list
# The columns are labelled using keys
# We can specify index for row labelling


<details>
<summary>Click to expand code</summary>

```python
# Creating a dataframe using a list of dictionary
# First lets create a list consisting of dictionary
my_list = [{1:'pandas', 2:'numpy', 3:'scipy', 4:'data science'},
          {1:'Amazon', 2:'Google', 3:'IBM', 4:'TalkValley'}]
# lets create a dataframe using list
# The columns are labelled using keys
# We can specify index for row labelling
my_dt = pd.DataFrame(data = my_list, index = ['Languages :','Companies :'])
print(f'The dataframe using a list of dictionary is :\n {my_dt}')

</details>


#  <font color = 'dodgerblue'> **Data indexing and selection**
* In numpy we used slicing, indexing etc. in order to get sub-arrays.
*  We wanted to access these sub-arrays to make easy modifications or operations.
* Similarly, we can access and modify values in pandas series and dataframe objects.


##  <font color = 'dodgerblue'> **Data selection in series**
We can think of a ``Series`` object as a standard Python dictionary or a one-dimensional NumPy array and we use similar ideas to select subset in series as well.


###  <font color = 'dodgerblue'> **Series as dictionary**
* Like a dictionary, even the series object provides a mapping from a collection of keys to a collection of values.
* We can also use series just like dictionary to examine the keys/indices and values.
* We can also modify the values of the series just like a dictionary.
Following examples will give a clear description about the topic.


In [None]:
# Lets create a series 


<details>
<summary>Click to expand code</summary>

```python
# Lets create a series 
my_ser = pd.Series([0.25, 10, 45, 26, 78, 36], index = ['first', 'second','third','fourth','fifth','sixth'])
print(f'The dataframe is :\n {my_ser}')
</details>


In [None]:
# We can access the values just by using index values


<details>
<summary>Click to expand code</summary>

```python
# We can access the values just by using index values
my_ser['first']
</details>


In [None]:
# We can access the index by using key() method like dictionary


<details>
<summary>Click to expand code</summary>

```python
# We can access the index by using key() method like dictionary
print(f"The index are:\n {my_ser.keys()}")
</details>


In [None]:
# We can also access the values with their keys by using items() method and store them in list


<details>
<summary>Click to expand code</summary>

```python
# We can also access the values with their keys by using items() method and store them in list
print(f"\nthe values are :\n {list(my_ser.items())}")
</details>


In [None]:
# We can check whether a index is present in the series as follows


<details>
<summary>Click to expand code</summary>

```python
# We can check whether a index is present in the series as follows
print(f"\n'third' is present in series: {'third' in my_ser}")
</details>


###  <font color = 'dodgerblue'> **Series as one- dimensional array**
A Series also lets you select items in an array-like way using the same basic methods as NumPy arrays: 

 * slicing
 *  masking 
 * fancy indexing 

We provide the examples below:


#### **Slicing**

**Syntax** :



```
    my_series = pandas.Series()
    sub_series = my_series[start: stop: step]
```




In [None]:
# CCreate a series


<details>
<summary>Click to expand code</summary>

```python
# CCreate a series
my_data = pd.Series([12, 78, 88, 92, 25, 14, 75], index = ['a', 'b', 'c', 'd', 'e', 'f', 'g'])
my_data
</details>


#####  <font color = 'dodgerblue'>  **Using slicing by explicit index** 

In [None]:
# slicing by explicit index
# It will give the data from a to d index
# In explicit/level-based indexing (where we specify the index rather than the position like 0, 1), 
# we get the final index included in the output
# We will get all the values indexed from a to d.


<details>
<summary>Click to expand code</summary>

```python
# slicing by explicit index
# It will give the data from a to d index
# In explicit/level-based indexing (where we specify the index rather than the position like 0, 1), 
# we get the final index included in the output
# We will get all the values indexed from a to d.
my_data['a' : 'd']
</details>


#####  <font color = 'dodgerblue'>  **Using slicing by implicit integer index**

In [None]:
# In implicit/integer-based indexing the final index is excluded
# In the example below,  we will get the values indexed from 0 to 2 and the value at index 3 will not be considered


<details>
<summary>Click to expand code</summary>

```python
# In implicit/integer-based indexing the final index is excluded
# In the example below,  we will get the values indexed from 0 to 2 and the value at index 3 will not be considered
my_data[0 : 3]
</details>


####  <font color = 'dodgerblue'> **Masking**

* Masking allows us to work with the boolean lists.
* When applied to the original data it will return the data corresponding to index for which the vvalue the list is True.


In [None]:
#Lets creat a series of data


<details>
<summary>Click to expand code</summary>

```python
#Lets creat a series of data
my_ser = pd.Series([10, 20, 30, 40, 50, 60, 70, 80])
print(f"My series is \n {my_ser}")
</details>


In [None]:
# Sum of elements that meet the condition : 20 > value > 50
# Using for loop to access the data
  #Lets use a condition that n should be between 20 and 50


<details>
<summary>Click to expand code</summary>

```python
# Sum of elements that meet the condition : 20 > value > 50
# Using for loop to access the data
total = 0
for n in my_ser:
  #Lets use a condition that n should be between 20 and 50
    if (n>20) and (n<50):
        total += n
print(f"the total sum is {total}")
</details>


In [None]:
# Let's do sum using list comprehension for the same data


<details>
<summary>Click to expand code</summary>

```python
# Let's do sum using list comprehension for the same data
sum([n for n in my_ser if (n>20) and (n<50)])
</details>


In [None]:
# But what if there is more data 
# it will take time for the for loop or list comprehension
# Masking will make it easier for data selection or operations


<details>
<summary>Click to expand code</summary>

```python
# But what if there is more data 
# it will take time for the for loop or list comprehension
# Masking will make it easier for data selection or operations

my_mask = ((my_ser>20) & (my_ser<50))
print(f"If the condition is True, then mask returns True. Otherwise it returns False \n {my_mask}")
</details>


In [None]:
# selected sub array


<details>
<summary>Click to expand code</summary>

```python
# selected sub array
s = my_ser[my_mask]
print(f'Selected sub array using mask is :\n {s}')
</details>


In [None]:
# Sum of the masked condition is 


<details>
<summary>Click to expand code</summary>

```python
# Sum of the masked condition is 
s = my_ser[my_mask].sum()
print(f"Therefore, the sum using mask is {s}")
</details>


In [None]:
# we can accomplish the above task in one statement


<details>
<summary>Click to expand code</summary>

```python
# we can accomplish the above task in one statement

my_ser[(my_ser> 20) & (my_ser<50)].sum()
</details>


####  <font color = 'dodgerblue'> **Fancy indexing**

* Fancy indexing allows us to pass arrays of indices in place of passing a single index.


  **Syntax** :


```
    data= pandas.Series()
    indices = [3, 2, 5]
    data[indices]
```
where, indices is a list of indices required.
data is the series in pandas which is used to extract the data.


In [None]:
# Fancy indexing
# Lets prepare the list of indices 
# Lets pass this list inside the dataframe


<details>
<summary>Click to expand code</summary>

```python
# Fancy indexing

# Lets prepare the list of indices 
ind = ['a', 'e' , 'd']

# Lets pass this list inside the dataframe
my_data[ind]
</details>


###  <font color = 'dodgerblue'> **indexers: loc, iloc**

If we have explicit integer index, slicing and indexing conventions can create confusion.
Let us see an example below:



<details>
<summary>Click to expand code</summary>

```python
my_data = pd.Series(['python', 'numpy', 'pandas'], index=[1, 3, 5])
my_data
</details>


In [None]:
# let us see what happens in indexing


<details>
<summary>Click to expand code</summary>

```python
# let us see what happens in indexing
my_data[1]
</details>


In [None]:
# let us see what happens in slicing


<details>
<summary>Click to expand code</summary>

```python
# let us see what happens in slicing
my_data[1:2]
</details>


- So in indexing, it uses explicit index (labels that we passed) and in slicing it uses implicit index (default index of python where first element is indexed as zero)
- To avoid this confusion, we can use loc() method for explicit indexing and iloc method (integer indexing) for implicit indexing

In [None]:
# explicit indexing using loc - get the item where explicit index is 1


<details>
<summary>Click to expand code</summary>

```python
# explicit indexing using loc - get the item where explicit index is 1
my_data.loc[1]
</details>


In [None]:
# implicit indexing using iloc - get the item where implicit indx is 1 (i.e second element)


<details>
<summary>Click to expand code</summary>

```python
# implicit indexing using iloc - get the item where implicit indx is 1 (i.e second element)
my_data.iloc[1]
</details>



### <font color = 'dodgerblue'> **Summary** : </font>

<font color = 'red'>

1.  Use masking if you want to select data based on condition.
2. Use loc for explicit indexing - if you know the exact row labels
3. Use iloc (integer) for implicit indexing - useful when we do not know the exact lables but know the sequence. For example - we want first five rows, last five rows or we want to iterate over the rows in a for loop. 

</font>

## <font color = 'dodgerblue'> **Data selection in DataFrame** 
There are various ways to select the data from the DataFrame as follows :
 
 * Dataframe as a dictionary
 * Dataframe as two dimensional array 

### **DataFrame as a dictionary**
Let's consider DataFrame as a dictionary. Then we can use dictionary way to select the data as follows:

In [None]:
# Let's try to make two series by using dictionary
#Let's convert these series into dataframe


<details>
<summary>Click to expand code</summary>

```python
# Let's try to make two series by using dictionary
my_area = pd.Series({1: 12305, 2: 26453, 3: 45789, 4: 12456, 5: 20134})
countries = pd.Series({1: 'India', 2: 'Austria', 3: 'Africa', 4: 'America', 5: 'Sri lanka'})

#Let's convert these series into dataframe
my_dt = pd.DataFrame({'my_area': my_area, 'countries' : countries})
print(f"My dataframe is \n {my_dt}")

</details>


In [None]:
# We can acess the data by using dictionary-style indexing of the column name


<details>
<summary>Click to expand code</summary>

```python
# We can acess the data by using dictionary-style indexing of the column name
my_dt['my_area']
</details>


In [None]:
# We can also use the attribute -style access with column names that are strings


<details>
<summary>Click to expand code</summary>

```python
# We can also use the attribute -style access with column names that are strings
my_dt.countries
</details>


### <font color = 'dodgerblue'> **Dataframe as two dimensional array**
We can select the data by using following methods for a two dimensional array :
 * values
 * Slicing
 * loc (explicit indexing) is label based where we specify the rows and columns based on  labels of rows and columns
 * iloc (implicit indexing) stands for integer location where we can specify the rows and columns by their integer positional values.

Following examples will give much more insights:

#### <font color = 'dodgerblue'> **Values**

In [None]:
# Lets create the dictionary first
# Creating the dataframe


<details>
<summary>Click to expand code</summary>

```python
# Lets create the dictionary first
data = {
    'calories' :[425, 500, 482, 369, 289, 563],
    'hours' : [10, 20, 6, 4, 5, 6],
    'sleep' :[6, 6, 8, 8, 7 , 6]
}
# Creating the dataframe

dt = pd.DataFrame(data)
print(f"The dataframe is\n {dt}")
</details>


In [None]:
# we can use the values attribute to get the data without column names and index
# this method gives us the data as numpy array


<details>
<summary>Click to expand code</summary>

```python
# we can use the values attribute to get the data without column names and index
# this method gives us the data as numpy array
dt.values
</details>


In [None]:
# we can see that using values method gives us the numpy array


<details>
<summary>Click to expand code</summary>

```python
# we can see that using values method gives us the numpy array
type(dt.values)
</details>


#### <font color = 'dodgerblue'> **Slicing**

In [None]:
# Passing the single index to access a column


<details>
<summary>Click to expand code</summary>

```python
# Passing the single index to access a column
dt['calories']
</details>


In [None]:
# let us try slicing with columns


<details>
<summary>Click to expand code</summary>

```python
# let us try slicing with columns
dt['calories':'sleep']
</details>


In [None]:
# let us try using implicit indexing


<details>
<summary>Click to expand code</summary>

```python
# let us try using implicit indexing
dt[0:2]
</details>


<font color = 'red'> **Slicing Summary** </font>


- <font color = 'aqua'> In Pandas , slicing refers to rows and not columns. Again slicing will use implicit indexng and fetch corresponding rows.

- <font color = 'aqua'> Useful shotcuts - Acces single columns using indexing, for example `df['column name']` and use slicing to access rows `df[0:3]`. 
- <font color = 'aqua'> To avoid confusion, we should use `loc` and `iloc` methods. We can separate rows and columns using comma.



####  <font color = 'dodgerblue'> **indexers: loc, iloc**

In [None]:
# Lets create a dataframe


<details>
<summary>Click to expand code</summary>

```python
# Lets create a dataframe
my_data = {
    'name':['Olivia', 'Emma', 'Ava', 'Sophia', 'Isabella', 'Charlotte'],
    'age' : [12, 15, 13, 14, 17, 19],
    'class':['seventh', 'sixth', 'ninth', 'tenth', 'graduation', 'graduation']
}
dt3 = pd.DataFrame(my_data)
print(f"The dataframe is\n {dt3}\n")
</details>


In [None]:
# Lets use loc method to locate a particular data (first four rows and column 'name' )
# remember in explicit indexing both start and end values are included


<details>
<summary>Click to expand code</summary>

```python
# Lets use loc method to locate a particular data (first four rows and column 'name' )
# remember in explicit indexing both start and end values are included
dt3.loc[: 3,  'name']
</details>


In [None]:
# now let us get the same data using iloc
# remeber in implicit indexing start value is included but end valus is not included


<details>
<summary>Click to expand code</summary>

```python
# now let us get the same data using iloc
# remeber in implicit indexing start value is included but end valus is not included
dt3.iloc[: 4,  0]
</details>


In [None]:
# combine loc with masking
# Get name and age of students whose age is greater than 13 


<details>
<summary>Click to expand code</summary>

```python
# combine loc with masking
# Get name and age of students whose age is greater than 13 
dt3.loc[dt3.age>13, ['name', 'age']]
</details>


In [None]:
# combine loc with masking
# Get name and age of students whose age is greater than 13 but less than 17


<details>
<summary>Click to expand code</summary>

```python
# combine loc with masking
# Get name and age of students whose age is greater than 13 but less than 17
dt3.loc[(dt3.age>13) & (dt3.age<17), ['name', 'age']]
</details>


# <font color = 'dodgerblue'> **Uses cases** </font>
1. Pandas have a variety of use cases. In this notebook, we will discuss the following:

   * Handling missing values
   * Manupulating data
   * Sorting Data
   * Aggregation of data
   * Grouping data
   * Data summary using Pivot Tables
   * Merging, joining and concatenating data
  



## <font color = 'dodgerblue'> **Handling missing values** </font>

1. The real world data is not so homogeneous or clean.
2. The data may have missing values or null values.
3. Pandas, helps us to clean our data and make it in a presentable form.
4. Missing data are presented as null, NaN(not a number) or NA values
5. There are several methods for detecting, removing and replacing null values in pandas and they are:
   * isnull()
   * notnull()
   * dropna()
   * fillna()  

### <font color = 'dodgerblue'> **Detecting null values**
* isnull() and notnull() are useful for detecting null  values.
* We can get boolean mask for null using isnull() - for null values it will return True otherwise it will return False.
* We can get boolean mask for not null using notnull() - for not null values it will return True otherwise it will return False.

In [None]:
# Lets create a new dataframe


<details>
<summary>Click to expand code</summary>

```python
# Lets create a new dataframe

data2 = {
    'name' : ['alex', np.nan, 'ross', np.nan, 'potter', np.nan],
    'age' :  [10, np.nan, 20, 0, np.nan, 25 ]
}

dt2 = pd.DataFrame(data2)
print(f"the dataframe is\n {dt2}")
</details>


#### <font color = 'dodgerblue'> **is_null()**

In [None]:
# Detecting null values by using isnull() method
# it will give the boolean mask 


<details>
<summary>Click to expand code</summary>

```python
# Detecting null values by using isnull() method
# it will give the boolean mask 
null_mask = dt2.isnull()
print(f' Mask indicating null values:\n{null_mask}')
</details>


In [None]:
# we can count number of null values in each column by using .sum()


<details>
<summary>Click to expand code</summary>

```python
# we can count number of null values in each column by using .sum()
print(f' Number of null values:\n{dt2.isnull().sum()}')
</details>


In [None]:
# Get the percentage of null values in eacvh column


<details>
<summary>Click to expand code</summary>

```python
# Get the percentage of null values in eacvh column
print(f' \nPercentage of null values in each column:\n{100*dt2.isnull().sum()/len(dt2)}')
</details>


In [None]:
# we can also get number of null values in each row


<details>
<summary>Click to expand code</summary>

```python
# we can also get number of null values in each row
print(f' Number of null values in eah row:\n{dt2.isnull().sum(axis =1)}')
</details>


#### <font color = 'dodgerblue'> **not_null()**

In [None]:
# We can access the values which are not null by using notnull() method
# This will create a mask where all the missing values or NaN values are shown as False
# And the non missing values are shown as True


<details>
<summary>Click to expand code</summary>

```python
# We can access the values which are not null by using notnull() method
# This will create a mask where all the missing values or NaN values are shown as False
# And the non missing values are shown as True
dt2.notnull()
</details>


### <font color = 'dodgerblue'> **Dropping null values**

dropna() method is used to drop the null values present in the data.


In [None]:
# creating a dataframe
# we can use numpy to create nan values by using np.nan method


<details>
<summary>Click to expand code</summary>

```python
# creating a dataframe
# we can use numpy to create nan values by using np.nan method
data = {
    'columns': [0, 10, 20, np.nan , 0 , 12 , np.nan],
    'rows' : ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh']
}
df = pd.DataFrame(data)
print(f"The data frame is \n {df}")
</details>


In [None]:
# let's drop the NA/Nan values by using dropna()


<details>
<summary>Click to expand code</summary>

```python
# let's drop the NA/Nan values by using dropna()
df.dropna()
</details>


###  <font color = 'dodgerblue'> **Filling the null values**
fillna() is used to fill the NA values in the dataframe.

In [None]:
# Consider the dataframe df from the above example
# Replace (fill) Na values with zero


<details>
<summary>Click to expand code</summary>

```python
# Consider the dataframe df from the above example
# Replace (fill) Na values with zero

df.fillna(0)
</details>


In [None]:
# Using the forward fill to take the previous value forward


<details>
<summary>Click to expand code</summary>

```python
# Using the forward fill to take the previous value forward
df.fillna(method='ffill')
</details>


In [None]:
# Using the back-fill to take the next values backward
# Filling the nan values from the next values


<details>
<summary>Click to expand code</summary>

```python
# Using the back-fill to take the next values backward
# Filling the nan values from the next values
df.fillna(method='bfill')
</details>


##  <font color = 'dodgerblue'> **Data Manipulation**
We have many different methods to manipulate the data in the pandas. We will discuss some of these below:


###  <font color = 'dodgerblue'> **Adding a new columns to the data**
We can also add new columns to the existing data in pandas.
The example is given as follows:

In [None]:
# Adding new columns to the data
# Let's create a dataframe


<details>
<summary>Click to expand code</summary>

```python
# Adding new columns to the data
# Let's create a dataframe
data = {
    'name': ['alex', 'ami', 'ross', 'suzan', 'henry'],
    'class' : ['first', 'first', 'second', 'second', 'second'],
    'math_score' : [20, 20, 30, 50, 10],
    'english_score' : [30, 40, 50, 60, 20],
    'gender' :['Male', 'Female', 'Female', 'Female', 'Male']}
    
df1 = pd.DataFrame(data)
print(f"My dataframe is :\n {df1}")
</details>


In [None]:
# let's create a series consisting of age information of students


<details>
<summary>Click to expand code</summary>

```python
# let's create a series consisting of age information of students
age = pd.Series([4, 7, 8, 9, 5])
print(f"The series is :\n {age}")
</details>


In [None]:
# add a new column to df1 - using the series age
# the name of the new column should be age
# The modified  dataframe is 


<details>
<summary>Click to expand code</summary>

```python
# add a new column to df1 - using the series age
# the name of the new column should be age
df1['age'] = age

# The modified  dataframe is 
print(f"The modified dataframe is :\n")
df1
</details>


### <font color = 'dodgerblue'> **Modifying Data using index**

In [None]:
# chnage the age of ross to 20
# The modified  dataframe is 


<details>
<summary>Click to expand code</summary>

```python
# chnage the age of ross to 20
df1.loc[df1.name=='ross','age'] = 20
# The modified  dataframe is 
print(f"The modified dataframe is :\n {df1}")
</details>


### <font color = 'dodgerblue'> **Modifying Data using numpy ufuncs**
Pandas is built over numpy so any NumPy ufunc will work on Pandas Series and DataFrame objects


<details>
<summary>Click to expand code</summary>

```python
log_scores = np.log(df1.loc[:, ['math_score', 'english_score']])
</details>



<details>
<summary>Click to expand code</summary>

```python
log_scores
</details>


### <font color = 'dodgerblue'> **Modify/Create columns using apply() function**

* The apply function allows us to apply any function along an axis of the DataFrame.

In [None]:
# Use apply function  to subtract the mean from each column
# the apply function will aply same function to all the rows
# we can use the lambda function to specify the function


<details>
<summary>Click to expand code</summary>

```python
# Use apply function  to subtract the mean from each column
# the apply function will aply same function to all the rows
# we can use the lambda function to specify the function

df1[['math_score', 'english_score']] = df1[['math_score', 'english_score']].apply(lambda x : (x-x.mean()))
</details>



<details>
<summary>Click to expand code</summary>

```python
df1
</details>


In [None]:
# create a new datafarme


<details>
<summary>Click to expand code</summary>

```python
# create a new datafarme
my_arr = {'review_id' :[10, 45, 26, 36, 78, 46],
          'text' :['MOVIE was good', 'it was AMAZING', 'horible', 'EXCELLENT', 'WIll watch again', 'DO not waste time']
          }
df_review = pd.DataFrame(data = my_arr)
print(f"the new dataframe is \n {df_review}")
</details>


In [None]:
# use apply function to create a new column with lowercase test


<details>
<summary>Click to expand code</summary>

```python
# use apply function to create a new column with lowercase test

df_review['text_lower'] = df_review['text'].apply(lambda x: x.lower())
df_review
</details>


### <font color = 'dodgerblue'> **Use filter() function to select subset of the data**
Subset the dataframe rows or columns according to the specified index labels.

In [None]:
# select columns by column names


<details>
<summary>Click to expand code</summary>

```python
# select columns by column names
df1.filter(items = ['math_score', 'english_score'])
</details>


In [None]:
# select columns by regular expression


<details>
<summary>Click to expand code</summary>

```python
# select columns by regular expression
df1.filter(regex = '.*score', axis =1)
</details>


### <font color = 'dodgerblue'> **Modify/create columns using Binary operations**

In [None]:
# add a new column which gives the total score (math score + english score)
# The modified  dataframe is 


<details>
<summary>Click to expand code</summary>

```python
# add a new column which gives the total score (math score + english score)
df1['total_score'] = df1['math_score'] + df1['english_score']
# The modified  dataframe is 
print(f"The modified dataframe is :\n {df1}")
</details>


In [None]:
# we can achieve this using the eval function as well
# eval function can be used to evaluate any matahematical operations
# using dataframe column names


<details>
<summary>Click to expand code</summary>

```python
# we can achieve this using the eval function as well
# eval function can be used to evaluate any matahematical operations
# using dataframe column names
df1['total_score'] = df1.eval('math_score + english_score')
</details>



<details>
<summary>Click to expand code</summary>

```python
df1
</details>


In [None]:
# operation between two series with different index


<details>
<summary>Click to expand code</summary>

```python
# operation between two series with different index
my_ser1 = pd.Series(data = [1, 2, 3, 4], index = ['A', 'B','C', 'D'])
my_ser2 = pd.Series(data = [5, 6, 7, 8], index = ['A', 'B','C', 'F'])
</details>



<details>
<summary>Click to expand code</summary>

```python
my_ser1.add(my_ser2)
</details>


<font color = 'red'> **Note:** </font>
<font color = 'aqua'>
As we can see from above operation, when we add series or dataframes, Pandas will allign the indices in performing  binary operations. The resulting index will be a union of the two indices.

In [None]:
# Operation with datafarmes/series  of different sizes
# similar to broadcasting in numpy


<details>
<summary>Click to expand code</summary>

```python
# Operation with datafarmes/series  of different sizes
# similar to broadcasting in numpy
data = np.arange(12).reshape(3, 4)
df = pd.DataFrame(data = data, columns = ['A', 'B','C', 'D'])
df
</details>


In [None]:
# subtracting row from a dataframe
# let us first select a row


<details>
<summary>Click to expand code</summary>

```python
# subtracting row from a dataframe
# let us first select a row
row = df.iloc[0, :]
row
</details>


In [None]:
# For rows, we have  to match the series and dataframes along columns
# hence in sub() we will specify axis = 1


<details>
<summary>Click to expand code</summary>

```python
# For rows, we have  to match the series and dataframes along columns
# hence in sub() we will specify axis = 1
df_row_sub  = df.sub(row, axis=1)
df_row_sub
</details>


In [None]:
# subtracting column from a dataframe
# let us select a column first


<details>
<summary>Click to expand code</summary>

```python
# subtracting column from a dataframe
# let us select a column first
col = df.iloc[:, 0]
col
</details>


In [None]:
# For columns we have  to match the series and dataframes along rows (index)
# hence in sub() we will specify axis = 1


<details>
<summary>Click to expand code</summary>

```python
# For columns we have  to match the series and dataframes along rows (index)
# hence in sub() we will specify axis = 1
df_col_sub  = df.sub(df.iloc[:, 0], axis=0)
df_col_sub
</details>


### <font color = 'dodgerblue'>**Deleting rows/columns**
We can delete the rows/columns by using the drop() method.

**Syntax**:


```
    DataFrame.drop(index,columns, inplace = True/False)
```
* The index is the index of row to be deleted; columns: is used to specify column names to be deleted.

* If we keep inplace value as **True** then the changes will be applied to the dataframe.
* Otherwise it will just create a temporary view of the result.

In [None]:
# Deleting rows/columns
# lets consider the dataframe mentioned above


<details>
<summary>Click to expand code</summary>

```python
# Deleting rows/columns
# lets consider the dataframe mentioned above

df1_copy = df1.copy()
print(f"My dataframe is \n {df1_copy}")
</details>


In [None]:
# remove second row and column age


<details>
<summary>Click to expand code</summary>

```python
# remove second row and column age
df1_copy.drop(index = 1, columns ='age', inplace = True)
print(f"The new dataframe is \n {df1_copy}")
</details>


In [None]:
# remove math_score_column


<details>
<summary>Click to expand code</summary>

```python
# remove math_score_column
df1_copy.drop( columns ='math_score', inplace = True)
print(f"The new dataframe is \n {df1_copy}")
</details>


### <font color = 'dodgerblue'> **Truncate a data from before or after some specified index**
We can truncate the data between some specific index and return the required data.

**Syntax** :


```
DataFrame.truncate(before, after, axis)
```
where, we can mention the before or after index to truncate data.



<details>
<summary>Click to expand code</summary>

```python
print(f"My dataframe is :\n {df1}")
</details>


In [None]:
# remove first and last rows


<details>
<summary>Click to expand code</summary>

```python
# remove first and last rows

df1_trunc = df1.truncate(before = 1, after = 2)
print(f"\n My new dataframe is \n {df1_trunc}")

</details>


## <font color = 'dodgerblue'>**Sorting Data**


<details>
<summary>Click to expand code</summary>

```python
df1
</details>


In [None]:
# row sort (based on index)


<details>
<summary>Click to expand code</summary>

```python
# row sort (based on index)
df1.sort_index(ascending=False, axis =0)
</details>


In [None]:
# column sort (based on column names)


<details>
<summary>Click to expand code</summary>

```python
# column sort (based on column names)
df1.sort_index( axis =1)
</details>


In [None]:
# sort based on values in a column


<details>
<summary>Click to expand code</summary>

```python
# sort based on values in a column
df1.sort_values(by = 'age', ascending=False, inplace = True)
</details>



<details>
<summary>Click to expand code</summary>

```python
df1
</details>


## <font color = 'dodgerblue'> **Aggregation**

* Similar to numpy, we can use aggregate functions in Pandas as well.


<img src = "https://drive.google.com/uc?view=export&id=1DXhiBZlNvBs6-ltdFXSSWwpudXTvMXaV" width ="400" />


###  <font color = 'dodgerblue'> **count()**

In [None]:
# Let us  use the dataframe from the above example


<details>
<summary>Click to expand code</summary>

```python
# Let us  use the dataframe from the above example
print(f"the dataframe is:\n {df1}")
</details>


In [None]:
# Count() function
# lets count the total number of items in dataframe
# We will get the total number of elements in each column as follows


<details>
<summary>Click to expand code</summary>

```python
# Count() function
# lets count the total number of items in dataframe
# We will get the total number of elements in each column as follows
df1.count()
</details>


###  <font color = 'dodgerblue'> **head(), tail()**

In [None]:
# head(),tail()
# get the top 2 rows


<details>
<summary>Click to expand code</summary>

```python
# head(),tail()
# get the top 2 rows
print(f"the top 2 rows are \n {df1.head(2)}")
</details>


In [None]:
# gets the last 3 rows


<details>
<summary>Click to expand code</summary>

```python
# gets the last 3 rows
print(f"\nthe last 3 rows are \n {df1.tail(3)}")
</details>


###  <font color = 'dodgerblue'> **mean/median/std/var etc**

In [None]:
# mean, median,std,var methods
# it gives the mean  and median for each column if they are numerical


<details>
<summary>Click to expand code</summary>

```python
# mean, median,std,var methods
# it gives the mean  and median for each column if they are numerical
print(f"the mean of the data is \n{df1.mean()}")
print(f"\nthe median of the data is \n{df1.median()}")
print(f"\nthe standard deviation of the data is \n{df1.std()}")
print(f"\nthe variance of the data is \n{df1.var()}")
</details>


###  <font color = 'dodgerblue'> **describe()**

In [None]:
# We can also use describe() method to get all the descriptive statistics


<details>
<summary>Click to expand code</summary>

```python
# We can also use describe() method to get all the descriptive statistics
print(f"the descriptive statistics  of the data is \n {df1.describe()}")
</details>


###  <font color = 'dodgerblue'> **info()**

In [None]:
# we can get the information about the columns using df.info() method


<details>
<summary>Click to expand code</summary>

```python
# we can get the information about the columns using df.info() method
df1.info()
</details>


###  <font color = 'dodgerblue'> **Applying functions along rows or columns**


<details>
<summary>Click to expand code</summary>

```python
df_score = df1.loc[:, ['math_score','english_score']]
df_score
</details>


In [None]:
# average score of each subject 
# axis = 0 : row sum for each column (subject)


<details>
<summary>Click to expand code</summary>

```python
# average score of each subject 
# axis = 0 : row sum for each column (subject)
df_score.mean(axis =0)
</details>


In [None]:
# total score for each student
# axis = 1 - column sum for each row (student)


<details>
<summary>Click to expand code</summary>

```python
# total score for each student
# axis = 1 - column sum for each row (student)
df_score.sum(axis=1)
</details>


In [None]:
# total score (sum of all the scores for all students)


<details>
<summary>Click to expand code</summary>

```python
# total score (sum of all the scores for all students)
df_score.sum().sum()
</details>


###  <font color = 'dodgerblue'> **value_counts(): count of unique values**

In [None]:
# Find number of males and females in the data
# value_counts give the count of unique values
# the gender column had two uniqie values - male and female
# value_count will give count of females and males


<details>
<summary>Click to expand code</summary>

```python
# Find number of males and females in the data
# value_counts give the count of unique values
# the gender column had two uniqie values - male and female
# value_count will give count of females and males
df1['gender'].value_counts()
</details>


In [None]:
# we can get the percentage of each category by passing argument normalize = True


<details>
<summary>Click to expand code</summary>

```python
# we can get the percentage of each category by passing argument normalize = True
df1['gender'].value_counts(normalize = True)
</details>


## <font color = 'dodgerblue'> **Grouping data**</font>
1. Aggregate functions helps us in understanding the data.
2. But sometimes we may require to apply these aggregate function to a group of data.
4.  Groupby mainly refers to a process involving one or more of the following steps:

  * Split : We split data into groups by applying some conditions on data.

  * Combine : We combine results from different groups.
**Syntax** :



```
  DataFrame.groupby().sum()
```
here, first we group the data based on condition inside the parethesis. 

Then it will take sum of the given grouped data.


Visual representation of a groupby operation
<img src="https://drive.google.com/uc?export=view&id=10khD_6G02Z624KRjhyCGuicodjhHRege" width="800"/>


<details>
<summary>Click to expand code</summary>

```python
df1
</details>


In [None]:
# group by function on the dataframe
# lets group students based on gender


<details>
<summary>Click to expand code</summary>

```python
# group by function on the dataframe
# lets group students based on gender
df1.groupby('gender')
</details>


In [None]:
# Now lets apply mean() funtion to the grouped data
# It will return the mean of the items for each gender


<details>
<summary>Click to expand code</summary>

```python
# Now lets apply mean() funtion to the grouped data
# It will return the mean of the items for each gender

df1.groupby('gender').mean()
</details>


#### <FONT COLOR = 'dodgerblue'> **groupby with aggregate()**

* We are already familiar with the aggregations such as sum(), median(), std() etc.
* But we can use aggregate functions in group by which provides the flexibility to perform any kind of aggregation on the given data.


**Syntax** :


```
  DataFrame.aggregate(func, axis)

```
we can pass the required function and axis to apply the function.



<details>
<summary>Click to expand code</summary>

```python
print(f"My dataframe is: \n {df1}")
</details>


In [None]:
# lets group the data then apply aggregate function
# we want minimum and maximum value of each column for males and females


<details>
<summary>Click to expand code</summary>

```python
# lets group the data then apply aggregate function
# we want minimum and maximum value of each column for males and females
df1.groupby('gender').aggregate([min, max])
</details>


#### <FONT COLOR = 'dodgerblue'> **filter out groups using groupby and filter**

We can use filter with groupby() as well. This will allows us to drop or filter data based on the group properties.

**Example** :
We want to keep all the groups for whom the maximum value of total_score is greate than 0


In [None]:
# lets create a function
# now pass the function in filter function
# code here


<details>
<summary>Click to expand code</summary>

```python
# lets create a function

def filter_func(df):
  return df['total_score'].max() > 0

# now pass the function in filter function
# code here
</details>


In [None]:
# use lamda expression to get the same result


<details>
<summary>Click to expand code</summary>

```python
# use lamda expression to get the same result
df1.groupby('gender').filter(lambda x: x['total_score'].max()>0)
</details>


<font color = 'red'> **Note:** </font> <font color = 'aqua'>   We get all the obsevations for the group for which the maximum value of total_score was greater than 70. The filter is applied at the group level.

## <font color = 'dodgerblue'> **Pivot Tables**
Pandas supports spreadsheet-like pivot tables that allow quick data summarization.

In [None]:
# let us get the mean  of total_score for each gender using groupby


<details>
<summary>Click to expand code</summary>

```python
# let us get the mean  of total_score for each gender using groupby
df1.groupby('gender')[['total_score']].mean()
</details>


In [None]:
# let us get the mean  of total_score for each gender in each class using groupby


<details>
<summary>Click to expand code</summary>

```python
# let us get the mean  of total_score for each gender in each class using groupby
df1.groupby(['gender','class'])[['total_score']].mean()
</details>


In [None]:
# we can unstack the above results


<details>
<summary>Click to expand code</summary>

```python
# we can unstack the above results
df1.groupby(['gender','class'])[['total_score']].mean().unstack()
</details>


In [None]:
# This could be done more easily using pivot tables


<details>
<summary>Click to expand code</summary>

```python
# This could be done more easily using pivot tables
df1.pivot_table('total_score', index ='gender', columns = 'class')
</details>


In [None]:
# the default aggregate function is mean, we can easily change this to any otyher function


<details>
<summary>Click to expand code</summary>

```python
# the default aggregate function is mean, we can easily change this to any otyher function
df1.pivot_table('total_score', index ='gender', columns = 'class', aggfunc='min')
</details>


## <font color = 'dodgerblue'> **Combine Datasets: Merge and Join**
With Pandas we can use SQL-like joins on DataFrames. Pandas support various types of joins like: inner joins, left/right outer joins and full joins. We will now illustrate these below:

In [None]:
# Create dataframe


<details>
<summary>Click to expand code</summary>

```python
import pandas as pd
# Create dataframe
data = {'user_id':[1, 2, 3, 4, 5, 6, 7, 1, 2, 3],
       'item_id': [1, 2, 3, 1, 2, 4, 5, 6, 7, 8],
       'rating':  [1, 1, 5, 5, 4, 4, 4, 3, 2, 1]}
ratings = pd.DataFrame(data)

title_names = {'item_id': [2, 3, 4, 5, 6, 7, 8, 9, 10],
          'movie_names': ' B C D E F G H I J '.split(),
          }
          
movie_title = pd.DataFrame(title_names)

print(f'Ratings data\n{ratings}')
print(f'\nMovie Titles\n{movie_title}')
</details>


### <font color = 'dodgerblue'> **Merge-INNER JOIN**

In [None]:
# join data sets using merge


<details>
<summary>Click to expand code</summary>

```python
# join data sets using merge
pd.merge(left=ratings, right=movie_title, on="item_id")
</details>


<font color = 'red'> **Note:** </font> <font color = 'aqua'>  Item_id 1, 9 and 10  were dropped because they don't exist in **both `DataFrame`s**. itemid 1 was missing from movie_title and item_id 9 amd 10 were not present in ratings dataframe This is the equivalent of a SQL `INNER JOIN`. If we want a `FULL OUTER JOIN`, where no item_id is dropped and `NaN` values are added, we should specify `how="outer"`:

### <font color = 'dodgerblue'> **Merge-OUTER JOIN**


<details>
<summary>Click to expand code</summary>

```python
all_titles = pd.merge(left=ratings, right=movie_title, on="item_id", how = 'outer')
all_titles
</details>


We can also use LEFT OUTER JOIN by setting how="left": only the movies present in the ratings DataFrame will appear in the result. Similarly, with how="right" only movies in the right DataFrame will appear in the result. Let us see example of LEFT OUTER JOIN


<details>
<summary>Click to expand code</summary>

```python
movies_with_ratings_only = pd.merge(left=ratings, right=movie_title, on="item_id", how = 'left')
movies_with_ratings_only
</details>



<details>
<summary>Click to expand code</summary>

```python
movies_with_title_info_only = pd.merge(left=ratings, right=movie_title, on="item_id", how = 'right')
movies_with_title_info_only 
</details>


### <font color = 'dodgerblue'> **Merge - key column names are different**


<details>
<summary>Click to expand code</summary>

```python
movie_title_2 = movie_title.copy()
movie_title_2.columns = ["movie_id", "movie_names"]
movie_title_2
</details>


If the key column names differ, then we have to use  use left_on and right_on. For example:


<details>
<summary>Click to expand code</summary>

```python
pd.merge(left=ratings, right=movie_title_2, left_on="item_id", right_on="movie_id")
</details>


## <font color = 'dodgerblue'>**Combine Datasets: Concat and Append**

Rather than joining DataFrames, we can just concatenate them using concat():


<details>
<summary>Click to expand code</summary>

```python
result_concat = pd.concat([ratings, movie_title])
result_concat
</details>


We can concatenate DataFrames horizontally instead of vertically by setting axis=1:


<details>
<summary>Click to expand code</summary>

```python
horizontal_concat = pd.concat([ratings, movie_title], axis =1)
horizontal_concat
</details>


However, this does not make sense, the concatenation was done using index values. The same item_id has different movie names. So we need to reindex the `DataFrame`s by item_id before concatenating:


<details>
<summary>Click to expand code</summary>

```python
horizontal_concat = pd.concat([ratings.set_index('item_id'), movie_title.set_index('item_id')], axis =1)
</details>


We get error above because item_id is not uniqie in ratings dataframe.

In [None]:
# Create a new datafarne which has average rating for each movie


<details>
<summary>Click to expand code</summary>

```python
# Create a new datafarne which has average rating for each movie
average_ratings = pd.DataFrame(ratings.groupby('item_id')['rating'].mean())
average_ratings
</details>



<details>
<summary>Click to expand code</summary>

```python
horizontal_concat = pd.concat([average_ratings, movie_title.set_index('item_id')], axis =1)
horizontal_concat
</details>


# <font color = 'dodgerblue'> **Saving and Loading Data**


<details>
<summary>Click to expand code</summary>

```python
df1
</details>


## <font color = 'dodgerblue'> **Saving Data**

In [None]:
# mount google drive
# so that we can save and load models/data from google drive


<details>
<summary>Click to expand code</summary>

```python
# mount google drive
# so that we can save and load models/data from google drive
from google.colab import drive
drive.mount('/content/drive')
</details>


In [None]:
# library to navigate file system


<details>
<summary>Click to expand code</summary>

```python
# library to navigate file system
from pathlib import Path
</details>


In [None]:
# check current director


<details>
<summary>Click to expand code</summary>

```python
# check current director
!pwd
</details>


If we do not provide specific location, files are saved in current directory. When we close Google Colab, these files will be lost. Thus we should save files/models on google drive.

In [None]:
# here I have saved save my data to folder Data in my Google drive. /content/drive/MyDrive refers to 
# path where google drive is mounted. /teaching_fall_2022 is folder in my Google drive. 
# You should change this to folder you want to save data in your your Google drive
# specify pathlib folder
# This is a system Path(PosixPath)


<details>
<summary>Click to expand code</summary>

```python
# here I have saved save my data to folder Data in my Google drive. /content/drive/MyDrive refers to 
# path where google drive is mounted. /teaching_fall_2022 is folder in my Google drive. 
# You should change this to folder you want to save data in your your Google drive

# specify pathlib folder
# This is a system Path(PosixPath)

data_folder = Path('/content/drive/MyDrive/teaching_fall_2022/ml-fall-2022/Lecture2_Numpy_Pandas')
</details>


We can construct a path to the file by joining the parts using the special operator /. The / can join several paths or a mix of paths and strings given, atleast one of those paths should be an instance of class Path from pathlib library (as shown below).


<details>
<summary>Click to expand code</summary>

```python
file_csv = data_folder / "df_pd_ml_21.csv"
file_json = data_folder / "df_pd_ml_21.json"
df1.to_csv(file_csv)
df1.to_json(file_json)
</details>


In [None]:
# check the content of the saved file


<details>
<summary>Click to expand code</summary>

```python
# check the content of the saved file
for filename in (file_csv, file_json):
    print("#", filename)
    with open(filename, "rt") as f:
        print(f.read())
        print()
</details>


## <font color = 'dodgerblue'> **Loading Data**


<details>
<summary>Click to expand code</summary>

```python
df1_csv_loaded = pd.read_csv(file_csv, index_col=0)
</details>



<details>
<summary>Click to expand code</summary>

```python
df1_csv_loaded
</details>



<details>
<summary>Click to expand code</summary>

```python
df1_json_loaded = pd.read_json(file_json)
</details>



<details>
<summary>Click to expand code</summary>

```python
df1_json_loaded
</details>
