# Introduction to Pandas

![alt text](http://i1.wp.com/blog.adeel.io/wp-content/uploads/2016/11/pandas1.png?zoom=1.25&fit=818%2C163)

Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. You can think of pandas as an extremely powerful version of Excel, with a lot more features.

## **About iPython Notebooks**

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. You only need to write code between the ### START CODE HERE ### and ### END CODE HERE ### comments. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run Cell" (denoted by a play symbol) in the left bar of the cell.

**In this notebook you will learn -**

* Series
* DataFrames
* Missing Data
* GroupBy
* Merging, Joining and Concatenating
* Operations
* Data Input and Output

## Importing Pandas
To import Pandas simply write the following:

In [0]:
import numpy as np
import pandas as pd

Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time.

# Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number.

In [0]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [0]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df)

        one       two     three
a -1.140865  0.658151  0.553755
b       NaN       NaN       NaN
c  0.517047  0.754190  1.248444
d       NaN       NaN       NaN
e  0.633126  0.672255  0.328480
f -0.511842  1.387641  0.394789
g       NaN       NaN       NaN
h  0.910683 -0.318443  0.834151


## Check for missing data

To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects 

In [0]:
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df['one'].isnull())

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool


In [0]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df['one'].notnull())

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool


**Exercise 3.1**

Check if the following dataframe contain missing data using **isnull()** function.

In [0]:
# importing pandas as pd 
import pandas as pd 
  
# importing numpy as np 
import numpy as np 
  
# dictionary of lists 
dict = {'First Score':[100, 90, np.nan, 95], 
        'Second Score': [30, 45, 56, np.nan], 
        'Third Score':[np.nan, 40, 80, 98]} 
  
# creating a dataframe from list 
df = pd.DataFrame(dict) 
  
# using isnull() function   
df.isnull() 

Unnamed: 0,First Score,Second Score,Third Score
0,False,False,True
1,False,False,False
2,True,False,False
3,False,True,False


## Dropping missing data

In [0]:
df.dropna()

Unnamed: 0,First Score,Second Score,Third Score
1,90.0,45.0,40.0


In [0]:
df.dropna(axis=1)

0
1
2
3


In [0]:
df.dropna(thresh=2)

Unnamed: 0,A,B,C,D
0,12,5,20,14
1,4,2,16,3
2,5,54,7,17
3,44,3,3,2
4,1,2,8,6


**Exercise 3.2**

Drop the missing values in the given dataframe.

In [0]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})


### START CODE HERE ### 
df.dropna(how = 'all') 
### END CODE HERE ### 


Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


## Filling missing values

In [0]:
df.fillna(value='FILL VALUE')

Unnamed: 0,A,B,C
0,1,5,1
1,2,FILL VALUE,2
2,FILL VALUE,FILL VALUE,3


## Finding mean value

You can also fill the missing values by the mean values of different row or column.

In [0]:
df['A'].fillna(value=df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

**Exercise 3.3**

 find the mean of all the observations over the index axis..

In [0]:
# importing pandas as pd 
import pandas as pd 
  
# Creating the dataframe  
df = pd.DataFrame({"A":[12, 4, 5, 44, 1], 
                   "B":[5, 2, 54, 3, 2],  
                   "C":[20, 16, 7, 3, 8], 
                   "D":[14, 3, 17, 2, 6]}) 
  
# Print the dataframe 
df 
df.mean(axis = 0) 


A    13.2
B    13.2
C    10.8
D     8.4
dtype: float64

# Great Job!

Exercise :

1. check if a dataframe has any missing values? , using null()

In [0]:
#Enter Your Code Here 

2. count the number of missing values in each column?


In [0]:
#Enter Your Code Here 

3. replace missing values of multiple numeric columns with the mean?

In [0]:
#Enter Your Code Here 

 4. check if a dataframe has any missing values? , using notnull()

In [0]:
#Enter Your Code Here 

5. Drop missing values and print shape of new DataFrame

In [0]:
#Enter Your Code Here 

6. Print shape of new DataFrame

In [0]:
#Enter Your Code Here 

 7. Print shape of original DataFrame

In [0]:
#Enter Your Code Here 

8. sum of the missing values in each column

In [0]:
#Enter Your Code Here 

9. fills all the missing values with the spcified value, inplace is False.

In [0]:
#Enter Your Code Here 

10. only specific column has missing values

In [0]:
#Enter Your Code Here 

SOLUTIONS:[MISSING DATA](https://drive.google.com/open?id=11H_5g3inOFwns2axO8JcWCcTnaHtpMFg)