# Exploring Data using pandas

## 1. Loading Data

In [17]:
import pandas as pd
df = pd.read_csv("imdb1.csv")

## 2. Understanding data

In [18]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,genres,runtime,rating,votes,gross
0,0,The Shawshank Redemption,Drama,142 min,9.3,2557442,$28.34M
1,1,The Dark Knight,Action| Crime| Drama,152 min,9.1,2515739,$534.86M
2,2,Inception,Action| Adventure| Sci-Fi,148 min,8.8,2245966,$292.58M
3,3,Fight Club,Drama,139 min,8.8,2013415,$37.03M
4,4,Forrest Gump,Drama| Romance,142 min,8.8,1973705,$330.25M


In [19]:
df.tail()

Unnamed: 0.1,Unnamed: 0,title,genres,runtime,rating,votes,gross
45,45,Star Wars: Episode VI - Return of the Jedi,Action| Adventure| Fantasy,131 min,8.3,1012709,$309.13M
46,46,Braveheart,Biography| Drama| History,178 min,8.4,1010028,$75.60M
47,47,Finding Nemo,Animation| Adventure| Comedy,100 min,8.2,1008792,$380.84M
48,48,Up,Animation| Adventure| Comedy,96 min,8.3,1004503,$293.00M
49,49,Reservoir Dogs,Crime| Drama| Thriller,99 min,8.3,982508,$2.83M


In [20]:
df.shape

(50, 7)

In [21]:
df.columns

Index(['Unnamed: 0', 'title', 'genres', 'runtime', 'rating', 'votes', 'gross'], dtype='object')

In [22]:
df.dtypes

Unnamed: 0      int64
title          object
genres         object
runtime        object
rating        float64
votes          object
gross          object
dtype: object

In [23]:
df.count()

Unnamed: 0    50
title         50
genres        48
runtime       50
rating        48
votes         49
gross         49
dtype: int64

## 3. Summarising and computing descriptive statistics

Calling DataFrame’s sum method returns a Series containing column sums. Passing axis=1 sums over the rows instead.


In [13]:
df.sum()

Unnamed: 0                                                 1225
title         The Shawshank RedemptionThe Dark KnightIncepti...
genres        Drama            Action| Crime| Drama         ...
runtime       142 min152 min148 min139 min142 min154 min136 ...
rating                                                    426.5
votes         2,557,4422,515,7392,245,9662,013,4151,973,7051...
gross         $28.34M$534.86M$292.58M$37.03M$330.25M$107.93M...
dtype: object

In [14]:
df.sum(axis = 1)

  df.sum(axis = 1)


0      9.3
1     10.1
2     10.8
3     11.8
4     12.8
5     13.9
6     14.7
7     15.9
8     17.0
9     18.2
10    18.4
11    19.8
12    20.6
13    21.5
14    22.5
15    23.3
16    24.3
17    25.6
18    26.1
19    27.6
20    28.6
21    30.0
22    30.5
23    31.5
24    32.6
25    33.2
26    34.7
27    36.0
28    36.5
29    36.9
30    38.6
31    39.4
32    39.9
33    41.6
34    42.7
35    43.2
36    44.5
37    45.2
38    46.4
39    47.1
40    48.6
41    49.5
42    50.2
43    50.9
44    52.5
45    53.3
46    54.4
47    55.2
48    56.3
49    57.3
dtype: float64

In [15]:
df.mean()

  df.mean()


Unnamed: 0    24.50
rating         8.53
dtype: float64

In [24]:
df.mean(axis = 1)

  df.mean(axis = 1)


0      4.65
1      5.05
2      5.40
3      5.90
4      6.40
5      6.95
6      7.35
7      7.95
8      8.50
9      9.10
10     9.20
11     9.90
12    10.30
13    13.00
14    11.25
15    11.65
16    12.15
17    12.80
18    13.05
19    13.80
20    20.00
21    15.00
22    15.25
23    15.75
24    16.30
25    16.60
26    17.35
27    18.00
28    18.25
29    18.45
30    19.30
31    19.70
32    19.95
33    20.80
34    21.35
35    21.60
36    22.25
37    22.60
38    23.20
39    23.55
40    24.30
41    24.75
42    25.10
43    25.45
44    26.25
45    26.65
46    27.20
47    27.60
48    28.15
49    28.65
dtype: float64

In [25]:
df.describe()

Unnamed: 0.1,Unnamed: 0,rating
count,50.0,48.0
mean,24.5,8.529167
std,14.57738,0.331315
min,0.0,7.9
25%,12.25,8.3
50%,24.5,8.5
75%,36.75,8.725
max,49.0,9.3


In [26]:
df['genres'].describe()

count                                         48
unique                                        24
top       Action| Adventure| Fantasy            
freq                                           5
Name: genres, dtype: object

## 4. Looking for missing data

- pandas uses the floating point value NaN (Not a Number) to represent missing data in
both floating as well as in non-floating point arrays.

- The built-in Python None value is also treated as NA in object arrays:

![image.png](attachment:image.png)

In [28]:
df.isnull()

Unnamed: 0    0
title         0
genres        2
runtime       0
rating        2
votes         1
gross         1
dtype: int64

In [29]:
df.isnull().sum()

Unnamed: 0    0
title         0
genres        2
runtime       0
rating        2
votes         1
gross         1
dtype: int64

In [35]:
df.isna().sum()

Unnamed: 0    0
title         0
genres        2
runtime       0
rating        2
votes         1
gross         1
dtype: int64

## 5. Handling Missing Value (Numerical)

### 5.1 Droping missing values

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

In [30]:
#dropna() drops any row containing missing values. 
newdf = df.dropna()

In [33]:
newdf.shape

(44, 7)

In [38]:
filtered_df = df[df.notnull()]
filtered_df.shape

(50, 7)

In [39]:
newdf1 = df.dropna(axis = 1)
newdf1.head()

Unnamed: 0.1,Unnamed: 0,title,runtime
0,0,The Shawshank Redemption,142 min
1,1,The Dark Knight,152 min
2,2,Inception,148 min
3,3,Fight Club,139 min
4,4,Forrest Gump,142 min


### 5.2 Filling Missing value

Rather than filtering out missing data (and potentially discarding other data along with
it), you may want to fill in the “holes” in any number of ways. For most purposes, the
fillna method is the workhorse function to use. Calling fillna with a constant replaces
missing values with that value:

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)

In [42]:
#using fillna() without any paramters
df.fillna(0)

Unnamed: 0.1,Unnamed: 0,title,genres,runtime,rating,votes,gross
0,0,The Shawshank Redemption,Drama,142 min,9.3,2557442,$28.34M
1,1,The Dark Knight,Action| Crime| Drama,152 min,9.1,2515739,$534.86M
2,2,Inception,Action| Adventure| Sci-Fi,148 min,8.8,2245966,$292.58M
3,3,Fight Club,Drama,139 min,8.8,2013415,$37.03M
4,4,Forrest Gump,Drama| Romance,142 min,8.8,1973705,$330.25M
5,5,Pulp Fiction,Crime| Drama,154 min,8.9,1965482,$107.93M
6,6,The Matrix,Action| Sci-Fi,136 min,8.7,1843346,$171.48M
7,7,The Lord of the Rings: The Fellowship of the Ring,Action| Adventure| Drama,178 min,8.9,1783703,$315.54M
8,8,The Lord of the Rings: The Return of the King,Action| Adventure| Drama,201 min,9.0,1761777,$377.85M
9,9,The Godfather,0,175 min,9.2,1760572,$134.97M


In [44]:
#Calling fillna with a dict you can use a different fill value for each column
df.fillna({"genres":'Comedy','rating':9.1,'votes':300000,'gross':'$200.00M'})[:5]

Unnamed: 0.1,Unnamed: 0,title,genres,runtime,rating,votes,gross
0,0,The Shawshank Redemption,Drama,142 min,9.3,2557442,$28.34M
1,1,The Dark Knight,Action| Crime| Drama,152 min,9.1,2515739,$534.86M
2,2,Inception,Action| Adventure| Sci-Fi,148 min,8.8,2245966,$292.58M
3,3,Fight Club,Drama,139 min,8.8,2013415,$37.03M
4,4,Forrest Gump,Drama| Romance,142 min,8.8,1973705,$330.25M


In [47]:
#using bfill() and fillna(method = 'bfill')
df.fillna(method='bfill')
df.bfill()[:5]

Unnamed: 0.1,Unnamed: 0,title,genres,runtime,rating,votes,gross
0,0,The Shawshank Redemption,Drama,142 min,9.3,2557442,$28.34M
1,1,The Dark Knight,Action| Crime| Drama,152 min,9.1,2515739,$534.86M
2,2,Inception,Action| Adventure| Sci-Fi,148 min,8.8,2245966,$292.58M
3,3,Fight Club,Drama,139 min,8.8,2013415,$37.03M
4,4,Forrest Gump,Drama| Romance,142 min,8.8,1973705,$330.25M


In [49]:
#using ffill() and fillna(method='ffill')
df.fillna(method = 'ffill')
df.ffill()[:5]

Unnamed: 0.1,Unnamed: 0,title,genres,runtime,rating,votes,gross
0,0,The Shawshank Redemption,Drama,142 min,9.3,2557442,$28.34M
1,1,The Dark Knight,Action| Crime| Drama,152 min,9.1,2515739,$534.86M
2,2,Inception,Action| Adventure| Sci-Fi,148 min,8.8,2245966,$292.58M
3,3,Fight Club,Drama,139 min,8.8,2013415,$37.03M
4,4,Forrest Gump,Drama| Romance,142 min,8.8,1973705,$330.25M


In [52]:
df.fillna({'genres':'Empty','rating':df['rating'].mean()})

Unnamed: 0.1,Unnamed: 0,title,genres,runtime,rating,votes,gross
0,0,The Shawshank Redemption,Drama,142 min,9.3,2557442.0,$28.34M
1,1,The Dark Knight,Action| Crime| Drama,152 min,9.1,2515739.0,$534.86M
2,2,Inception,Action| Adventure| Sci-Fi,148 min,8.8,2245966.0,$292.58M
3,3,Fight Club,Drama,139 min,8.8,2013415.0,$37.03M
4,4,Forrest Gump,Drama| Romance,142 min,8.8,1973705.0,$330.25M
5,5,Pulp Fiction,Crime| Drama,154 min,8.9,1965482.0,$107.93M
6,6,The Matrix,Action| Sci-Fi,136 min,8.7,1843346.0,$171.48M
7,7,The Lord of the Rings: The Fellowship of the Ring,Action| Adventure| Drama,178 min,8.9,1783703.0,$315.54M
8,8,The Lord of the Rings: The Return of the King,Action| Adventure| Drama,201 min,9.0,1761777.0,$377.85M
9,9,The Godfather,Empty,175 min,9.2,1760572.0,$134.97M


In [53]:
#use interpolate()
df.interpolate()

Unnamed: 0.1,Unnamed: 0,title,genres,runtime,rating,votes,gross
0,0,The Shawshank Redemption,Drama,142 min,9.3,2557442.0,$28.34M
1,1,The Dark Knight,Action| Crime| Drama,152 min,9.1,2515739.0,$534.86M
2,2,Inception,Action| Adventure| Sci-Fi,148 min,8.8,2245966.0,$292.58M
3,3,Fight Club,Drama,139 min,8.8,2013415.0,$37.03M
4,4,Forrest Gump,Drama| Romance,142 min,8.8,1973705.0,$330.25M
5,5,Pulp Fiction,Crime| Drama,154 min,8.9,1965482.0,$107.93M
6,6,The Matrix,Action| Sci-Fi,136 min,8.7,1843346.0,$171.48M
7,7,The Lord of the Rings: The Fellowship of the Ring,Action| Adventure| Drama,178 min,8.9,1783703.0,$315.54M
8,8,The Lord of the Rings: The Return of the King,Action| Adventure| Drama,201 min,9.0,1761777.0,$377.85M
9,9,The Godfather,,175 min,9.2,1760572.0,$134.97M


## 6 Handling missing values(Categorical)

#### 6.1 Deleting values

#### 6.2 Replacing with mode

#### 6.3 Predicting values

## HomeWork/TODO - Data Inspection and Handling Missing Values

Inspect yelp dataset and clean missing values using different techniques. Provide detailed documentation.