## Recommended Sites for Pandas

* [Datascience Made Simple](http://www.datasciencemadesimple.com)
* [Pandas tutorial by dataquest](https://www.dataquest.io/blog/pandas-python-tutorial/) [Blog]
* [Pandas official doc](http://pandas.pydata.org/pandas-docs/stable/) [Highly recommended]
* [Experiments with data](https://trainings.analyticsvidhya.com/courses/course-v1:AnalyticsVidhya+EWD01+2018_EWD_T1/about) [Course]

## Advance Pandas

In this notebook, we will be going through a lot of different concepts (that will help you all realize why pandas is so much important when it comes to Data Analysis). Mastering all the following concepts is not at all important, but knowing that they exist can be an advantage. You can always learn these concepts as and when your task in hand demands. These are all the concepts that are mentioned in this notebook
* [Sorting with nlargest, nsmallest and sort_values](#Sorting with nlargest, nsmallest and sort_values)
* [Replacing values in dataframe/series](#Replacing values in dataframe/series)
* [Renaming columns and indexes in datafram/series](#Renaming columns and indexes in datafram/series)
* [Handling Missing Values](#Handling Missing Values)
* [Descriptive stats](#Descriptive stats)
* [Combining dataframes](#Combining dataframes)
* [String manipulations in pandas](#String manipulations in pandas)
* [Plotting in pandas](#Plotting in pandas)

<span style="color:brown">** Note: **</span> All the concepts are not covered completely. Links to documentation are provided for most of the concepts covered here, do visit them if you want to learn them in greater details or to master them. (recommended !)

In [3]:
from google.colab import drive
import os
drive.mount('/content/drive')
os.chdir('drive/My Drive/courses/FML/2. Data Science/1. Data Analysis/2. Pandas')
!ls

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive
'1. pandas_intro.ipynb'
'2. pandas series.ipynb'
'3. Pandas DataFrame.ipynb'
'4. Selecting Subsets with [ ], .loc and .iloc.ipynb'
'5. Boolean Indexing.ipynb'
'6. Assigning subsets of data.ipynb'
'7. Other Important concepts in Pandas.ipynb'
'8. Groupby.ipynb'
'capstone projects'
 data
 images
'pandas from numpy.pptx'
'Pandas Solutions(Part 4-6).ipynb'
'samples codes'


In [4]:
import pandas as pd
df = pd.read_csv('data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


## Sorting with nlargest, nsmallest and sort_values

### <span style="color:brown"><b>Note :</b></span> In all the 3 methods, an ERROR is raised if dtype of the column is not supported.

###  nlargest

~~~ python
    > df.nlargest(n,columns)
~~~

where,<br>
*n - selects the number of row to be returned<br>
*columns - the columns that are to be considered


In [5]:
df.nlargest(3,['height'])

Unnamed: 0,state,color,food,age,height,score
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Jane,NY,blue,Steak,30,165,4.6


###  nsmallest

~~~ python
    > df.nsmallest(n,columns)
~~~

where,<br>
*n - selects the number of row to be returned<br>
*columns - the columns that are to be considered


In [6]:
df.nsmallest(3,['height','age','score'])

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3
Aaron,FL,red,Mango,12,120,9.0


### sort_values

~~~ python
    > df.sort_values(by,ascending)
~~~

where,<br>
*by - the column name(s) that are to be considered<br>
*ascending - True/False, default is *True*


In [7]:
df.sort_values('age', ascending=True).head(3)

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3
Aaron,FL,red,Mango,12,120,9.0


## Replacing values in dataframe/series

General syntax for replace function is as follows:

~~~ python
    > df.replace(to_replace, value)
~~~

**to_replace** and **value** both can be *str, regex, list, dict, Series, int, float, or None* 

In [8]:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
                    'B': [5, 6, 7, 8, 9],
                    'C': ['a', 'b', 'c', 'd', 'e']})
df

Unnamed: 0,A,B,C
0,0,5,a
1,1,6,b
2,2,7,c
3,3,8,d
4,4,9,e


In [9]:
df.replace(0, 5)

Unnamed: 0,A,B,C
0,5,5,a
1,1,6,b
2,2,7,c
3,3,8,d
4,4,9,e


### List-like `to_replace`

In [10]:
df.replace([0, 1, 2, 3], 4)

Unnamed: 0,A,B,C
0,4,5,a
1,4,6,b
2,4,7,c
3,4,8,d
4,4,9,e


In [11]:
df.replace([0, 1, 2, 3], [4, 3, 2, 1])

Unnamed: 0,A,B,C
0,4,5,a
1,3,6,b
2,2,7,c
3,1,8,d
4,4,9,e


### dict-like `to_replace`

In [12]:
df.replace({0: 10, 1: 100})

Unnamed: 0,A,B,C
0,10,5,a
1,100,6,b
2,2,7,c
3,3,8,d
4,4,9,e


In [13]:
df.replace({'A': 0, 'B': 5}, 100)

Unnamed: 0,A,B,C
0,100,100,a
1,1,6,b
2,2,7,c
3,3,8,d
4,4,9,e


In [14]:
df.replace({'A': {0: 100, 4: 400}})

Unnamed: 0,A,B,C
0,100,5,a
1,1,6,b
2,2,7,c
3,3,8,d
4,400,9,e


## Renaming columns and indexes in dataframe/series

General syntax for rename function is as follows:

    > df.rename(index, columns)

index and columns both are *dict-like*.

In [15]:
df = df.rename(columns={"A": "a", "B": "c"}, index={0:100})
df

Unnamed: 0,a,c,C
100,0,5,a
1,1,6,b
2,2,7,c
3,3,8,d
4,4,9,e


In [16]:
df.rename(index={0: "zero", "B": "c"})

Unnamed: 0,a,c,C
100,0,5,a
1,1,6,b
2,2,7,c
3,3,8,d
4,4,9,e


## Handling Missing Values

### fillna()

    > df.fillna(value)
   
*value - scalar, dict*

In [0]:
import numpy as np

In [19]:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                    [3, 4, np.nan, 1],
                    [np.nan, np.nan, np.nan, 5],
                    [np.nan, 3, np.nan, 4]],
                    columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [0]:
df['A']= df['A'].fillna(0)

In [21]:
df.fillna(0)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5
3,0.0,3.0,0.0,4


In [22]:
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df.fillna(value=values)

Unnamed: 0,A,B,C,D
0,0.0,2.0,2.0,0
1,3.0,4.0,2.0,1
2,0.0,1.0,2.0,5
3,0.0,3.0,2.0,4


### dropna

    > df.dropna(axis=0, how='any', thresh=None, subset=None)
    
* **how** - {'all', 'any'}, default is 'any'.<br>
* **thresh** - How many non-NA values Require.<br>
* **subset** - Labels along other axis to consider.<br>

In [23]:
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
                             pd.NaT]})
df

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [24]:
df.dropna() # drops all the rows that have atleast one element missing

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


In [25]:
df.dropna(axis=1) # drops all the columns that have atleast one element missing.

Unnamed: 0,name
0,Alfred
1,Batman
2,Catwoman


In [26]:
df.dropna(how='all')  #Drop the rows where all elements are missing.

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [0]:
df.dropna(thresh=2) #Keep only the rows with at least 2 non-NA values.

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [27]:
df.dropna(subset=['name', 'toy']) # Define in which columns to look for missing values.

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


### replace 

#### You can also use replace() to handle missing values. 

## Descriptive stats 

    > abs() - Return a Series/DataFrame with absolute numeric value of each element.
    > count() - Count non-NA cells for each column or row.
    > max() - returns the maximum of the values in the object.
    > min() - returns the minimum of the values in the object.
    > mean() - returns the mean of the values in the object.
    > median() - returns the median of the values in the object.
    > mode() - returns the mode of the values in the object.
    > sum() - returns the sum of the values in the object.

<span style="color:brown"> **Note: **</span> all the methods take *axis* as an argument.

In [0]:
import pandas as pd
import numpy as np

In [29]:
df = pd.read_csv('data/sample_data.csv', index_col=0)
df = df[['age', 'height', 'score']].copy()
df

Unnamed: 0,age,height,score
Jane,30,165,4.6
Niko,2,70,8.3
Aaron,12,120,9.0
Penelope,4,80,3.3
Dean,32,180,1.8
Christina,33,172,9.5
Cornelia,69,150,2.2


In [30]:
df.sum()

age       182.0
height    937.0
score      38.7
dtype: float64

In [0]:
df.values.mean()

55.12857142857143

In [0]:
df['score'] = 5
df

In [0]:
df['height'].unique()

### value_counts()

    Returns object containing counts of unique values.
<span style="color:brown"> **Note: **</span> only specific to series data structure.

## Combining dataframes

Pandas provides various facilities for easily combining together Series and DataFrame.<br>
<span style="color:brown">** Note:** </span> refer the [Merge, join and concat documentation](https://pandas.pydata.org/pandas-docs/stable/merging.html) for details

### concat() - based on index / column labels

    > pd.concat(objs, axis=0, join='outer', join_axes=None)
    
   * **axis** : {0, 1, …}, default 0. The axis to concatenate along.
   * **join** : {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for union and inner for intersection.
   * **join_axes** : list of Index objects. Specific indexes to use for the other n - 1 axes instead of performing inner/outer set logic.

In [0]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'C': ['C0', 'C1', 'C2', 'C3'],
                     'D': ['D0', 'D1', 'D2', 'D3']},
                     index=[0, 1, 2, 3]) 

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                     'B': ['B4', 'B5', 'B6', 'B7'],
                     'C': ['C4', 'C5', 'C6', 'C7'],
                     'D': ['D4', 'D5', 'D6', 'D7']},
                      index=[4, 5, 6, 7])
 

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                     'B': ['B8', 'B9', 'B10', 'B11'],
                     'C': ['C8', 'C9', 'C10', 'C11'],
                     'D': ['D8', 'D9', 'D10', 'D11']},
                     index=[8, 9, 10, 11])
 

In [32]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [33]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [34]:
df3

Unnamed: 0,A,B,C,D
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


In [39]:
result = pd.concat([df2,df1,df3], axis=1)
result
# o, number of records will increase
# 1, number of columns will increase

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,,,,,A0,B0,C0,D0,,,,
1,,,,,A1,B1,C1,D1,,,,
2,,,,,A2,B2,C2,D2,,,,
3,,,,,A3,B3,C3,D3,,,,
4,A4,B4,C4,D4,,,,,,,,
5,A5,B5,C5,D5,,,,,,,,
6,A6,B6,C6,D6,,,,,,,,
7,A7,B7,C7,D7,,,,,,,,
8,,,,,,,,,A8,B8,C8,D8
9,,,,,,,,,A9,B9,C9,D9


In [37]:
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                     'D': ['D2', 'D3', 'D6', 'D7'],
                     'F': ['F2', 'F3', 'F6', 'F7']},
                    index=[2, 3, 6, 7])

result2 = pd.concat([df1, df4], axis=1, join='inner')
result2

Unnamed: 0,A,B,C,D,B.1,D.1,F
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


![](images/merging_concat_axis1.png)

### append() 

A useful shortcut to concat() are the append() instance methods on Series and DataFrame.<br>
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns
~~~ python
> df.append(df2)
~~~   

### Merge and join - based on key values
#### Merge()
 Pandas provide join operations very similar to relational databases like SQL. *merge()*,a single function,  as the entry point for all standard database join operations between DataFrame objects
 ~~~ python
> pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None)
 ~~~
 * **how:** {'left', 'right', 'outer', 'inner'}. Defaults to inner.<br>
 
#### Join()
Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.
~~~ python
> DataFrame.join(other, on=None, how='left') 
~~~


## String manipulations in pandas

Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing. Most text operations are made simple with the string object’s built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed. pandas adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data, additionally handling the annoyance of missing data

Pandas is extremely power extremely powerful when its comes to text data. Go through [the documentation](https://pandas.pydata.org/pandas-docs/stable/text.html) and you will realize its!  

## Plotting in pandas

Rememebered, pandas is python library for data manipulation and data analysis. Untill now we have mostly looked at the manipulation part. You can plot your Pandas dataframes and series for visualization and analysis. Pandas have many powerful built-in functoins for data visualization and also, are very easy to use.

Refer [this documentation](https://pandas.pydata.org/pandas-docs/stable/visualization.html) to learn more about data visualization in pandas