# `Dataframes`

    Dataframes are 2D data structures in pandas. A dataframe can be considered as a collection of one or more series.
    
    We can either create a dataframe using pd.Dataframe() method or by using the pd.read_csv() method. Although the
    latter is more common.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame([
    [100, 90, 9],
    [90, 80, 8],
    [110, 95, 15]
], columns = ['iq', 'marks', 'package'])

df

Unnamed: 0,iq,marks,package
0,100,90,9
1,90,80,8
2,110,95,15


#### `we may also pass in a dictionary.`

In [3]:
df = pd.DataFrame({
    'iq': [100, 90, 110],
    'marks': [90, 80, 95],
    'package': [9, 8, 15] 
})

df

Unnamed: 0,iq,marks,package
0,100,90,9
1,90,80,8
2,110,95,15


## `using read_csv() functions`

In [4]:
movies = pd.read_csv('movies.csv')
movies.head()

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
0,Uri: The Surgical Strike,tt8291224,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Uri:_The_Surgica...,Uri: The Surgical Strike,Uri: The Surgical Strike,0,2019,138,Action|Drama|War,8.4,35112,Divided over five chapters the film chronicle...,Indian army special forces execute a covert op...,,Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Ga...,4 wins,11 January 2019 (USA)
1,Battalion 609,tt9472208,,https://en.wikipedia.org/wiki/Battalion_609,Battalion 609,Battalion 609,0,2019,131,War,4.1,73,The story revolves around a cricket match betw...,The story of Battalion 609 revolves around a c...,,Vicky Ahuja|Shoaib Ibrahim|Shrikant Kamat|Elen...,,11 January 2019 (India)
2,The Accidental Prime Minister (film),tt6986710,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/The_Accidental_P...,The Accidental Prime Minister,The Accidental Prime Minister,0,2019,112,Biography|Drama,6.1,5549,Based on the memoir by Indian policy analyst S...,Explores Manmohan Singh's tenure as the Prime ...,,Anupam Kher|Akshaye Khanna|Aahana Kumra|Atul S...,,11 January 2019 (USA)
3,Why Cheat India,tt8108208,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Why_Cheat_India,Why Cheat India,Why Cheat India,0,2019,121,Crime|Drama,6.0,1891,The movie focuses on existing malpractices in ...,The movie focuses on existing malpractices in ...,,Emraan Hashmi|Shreya Dhanwanthary|Snighdadeep ...,,18 January 2019 (USA)
4,Evening Shadows,tt6028796,,https://en.wikipedia.org/wiki/Evening_Shadows,Evening Shadows,Evening Shadows,0,2018,102,Drama,7.3,280,While gay rights and marriage equality has bee...,Under the 'Evening Shadows' truth often plays...,,Mona Ambegaonkar|Ananth Narayan Mahadevan|Deva...,17 wins & 1 nomination,11 January 2019 (India)


In [5]:
batsman_runs = pd.read_csv('batsman_runs_ipl.csv', index_col = 0)
batsman_runs.index.name = None
batsman_runs

Unnamed: 0,batsman_run
A Ashish Reddy,280
A Badoni,161
A Chandila,4
A Chopra,53
A Choudhary,25
...,...
Yash Dayal,0
Yashpal Singh,47
Younis Khan,3
Yuvraj Singh,2754


In [6]:
diabetes = pd.read_csv('diabetes.csv')
diabetes

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


## `Dataframe attributes`

    Since dataframes are at the end of the day python objects. We have methods and attributes attached to them.
    Methods are nothing but the function associated with the instance of a class.

### `shape`

    This returns us the shape of the dataframe. This would always be a tuple consisting of 2 elements.
    The first element tells the number of rows and the second element tells the number of columns in the dataframe.

In [7]:
movies.shape

(1629, 18)

### `dtypes`

    This attribute tells us about the data type of each individual column in our dataframe.

In [8]:
movies.dtypes

title_x              object
imdb_id              object
poster_path          object
wiki_link            object
title_y              object
original_title       object
is_adult              int64
year_of_release       int64
runtime              object
genres               object
imdb_rating         float64
imdb_votes            int64
story                object
summary              object
tagline              object
actors               object
wins_nominations     object
release_date         object
dtype: object

### `index`

    This is used for retreiving the index object out of the dataframe which was used to index the rows.

In [9]:
movies.index

RangeIndex(start=0, stop=1629, step=1)

### `columns`
    
    This returns the names of all the columns present in the dataframe.

In [10]:
movies.columns

Index(['title_x', 'imdb_id', 'poster_path', 'wiki_link', 'title_y',
       'original_title', 'is_adult', 'year_of_release', 'runtime', 'genres',
       'imdb_rating', 'imdb_votes', 'story', 'summary', 'tagline', 'actors',
       'wins_nominations', 'release_date'],
      dtype='object')

## `Dataframe methods`

### `head()`

    returns the top rows of the dataframe. By default gives us only 5 elements. We may specify the number of rows
    to be retreived. This returns us a VIEW and NOT a COPY.

In [11]:
movies.head(2)

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
0,Uri: The Surgical Strike,tt8291224,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Uri:_The_Surgica...,Uri: The Surgical Strike,Uri: The Surgical Strike,0,2019,138,Action|Drama|War,8.4,35112,Divided over five chapters the film chronicle...,Indian army special forces execute a covert op...,,Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Ga...,4 wins,11 January 2019 (USA)
1,Battalion 609,tt9472208,,https://en.wikipedia.org/wiki/Battalion_609,Battalion 609,Battalion 609,0,2019,131,War,4.1,73,The story revolves around a cricket match betw...,The story of Battalion 609 revolves around a c...,,Vicky Ahuja|Shoaib Ibrahim|Shrikant Kamat|Elen...,,11 January 2019 (India)


### `tail()`

In [12]:
movies.tail(2)

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
1627,Daaka,tt10833860,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Daaka,Daaka,Daaka,0,2019,136,Action,7.4,38,Shinda tries robbing a bank so he can be wealt...,Shinda tries robbing a bank so he can be wealt...,,Gippy Grewal|Zareen Khan|,,1 November 2019 (USA)
1628,Humsafar,tt2403201,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Humsafar,Humsafar,Humsafar,0,2011,35,Drama|Romance,9.0,2968,Sara and Ashar are childhood friends who share...,Ashar and Khirad are forced to get married due...,,Fawad Khan|,,TV Series (2011–2012)


### `sample()`

In [13]:
movies.sample(2)

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
1507,Badhaai Ho Badhaai,tt0325041,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Badhaai_Ho_Badhaai,Badhaai Ho Badhaai,Badhaai Ho Badhaai,0,2002,155,Comedy|Drama|Musical,4.4,585,The D'Souza and the Chaddha families are neigh...,The D'Souza and the Chaddha families are neigh...,,Anil Kapoor|,1 nomination,14 June 2002 (USA)
63,Made in China (2019 film),tt8983180,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Made_In_China_(2...,Made in China,Made in China,0,2019,128,Comedy|Drama,6.3,596,Story of Failed Gujarati Businessman who jump...,Story of Failed Gujarati Businessman who jump...,,Rajkummar Rao|Mouni Roy|Boman Irani|Paresh Raw...,,25 October 2019 (USA)


### `info()`

    This method prints us the total number of non-missing values in each column along with their data types.

In [14]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1629 entries, 0 to 1628
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   title_x           1629 non-null   object 
 1   imdb_id           1629 non-null   object 
 2   poster_path       1526 non-null   object 
 3   wiki_link         1629 non-null   object 
 4   title_y           1629 non-null   object 
 5   original_title    1629 non-null   object 
 6   is_adult          1629 non-null   int64  
 7   year_of_release   1629 non-null   int64  
 8   runtime           1629 non-null   object 
 9   genres            1629 non-null   object 
 10  imdb_rating       1629 non-null   float64
 11  imdb_votes        1629 non-null   int64  
 12  story             1609 non-null   object 
 13  summary           1629 non-null   object 
 14  tagline           557 non-null    object 
 15  actors            1624 non-null   object 
 16  wins_nominations  707 non-null    object 


### `describe()`

    This method is used to give a statistical summary of all the numerical columns of the dataframe.
    If you want it to give the summary of each and every column, then specify include = 'all'.

In [15]:
movies.describe()

Unnamed: 0,is_adult,year_of_release,imdb_rating,imdb_votes
count,1629.0,1629.0,1629.0,1629.0
mean,0.0,2010.263966,5.557459,5384.263352
std,0.0,5.381542,1.567609,14552.103231
min,0.0,2001.0,0.0,0.0
25%,0.0,2005.0,4.4,233.0
50%,0.0,2011.0,5.6,1000.0
75%,0.0,2015.0,6.8,4287.0
max,0.0,2019.0,9.4,310481.0


In [16]:
movies.describe(include='all')

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
count,1629,1629,1526,1629,1629,1629,1629.0,1629.0,1629,1629,1629.0,1629.0,1609,1629,557,1624,707,1522
unique,1625,1623,1517,1629,1620,1621,,,130,205,,,1603,1604,553,1617,229,1063
top,Tanu Weds Manu: Returns,tt2140465,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Uri:_The_Surgica...,Style,Tanu Weds Manu Returns,,,\N,Drama,,,Tanu and Manu's marriage collapses. What happe...,Add a Plot »,Once upon a time in India,Neil Nitin Mukesh|,1 nomination,14 October 2016 (India)
freq,2,2,4,1,3,2,,,119,162,,,2,20,2,2,115,5
mean,,,,,,,0.0,2010.263966,,,5.557459,5384.263352,,,,,,
std,,,,,,,0.0,5.381542,,,1.567609,14552.103231,,,,,,
min,,,,,,,0.0,2001.0,,,0.0,0.0,,,,,,
25%,,,,,,,0.0,2005.0,,,4.4,233.0,,,,,,
50%,,,,,,,0.0,2011.0,,,5.6,1000.0,,,,,,
75%,,,,,,,0.0,2015.0,,,6.8,4287.0,,,,,,


#### isnull()

    This iterates over the whole dataframe and tells us whether a value is missing or not.
    By missing value, we mean NaN value. A value which is not NaN is called non-missing value.

In [17]:
movies.isnull()

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
1,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False
4,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1624,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False
1625,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False
1626,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,True,True
1627,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False


    We may then use the sum() method on the obtained boolean dataframe to get the total number of missing values
    per column.

In [18]:
movies.isnull().sum()

title_x                0
imdb_id                0
poster_path          103
wiki_link              0
title_y                0
original_title         0
is_adult               0
year_of_release        0
runtime                0
genres                 0
imdb_rating            0
imdb_votes             0
story                 20
summary                0
tagline             1072
actors                 5
wins_nominations     922
release_date         107
dtype: int64

    We can now use sum() method on the above obtained series object to get the total number of missing values in
    the entire dataframe.

In [19]:
movies.isnull().sum().sum()

2229

## `duplicated()`

    The duplicated() method is used to return us a boolean series which tells if a row has been repeated or not.
    Two rows are said to be equal when each and every entry of these rows are equal.

In [20]:
movies.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1624    False
1625    False
1626    False
1627    False
1628    False
Length: 1629, dtype: bool

    We can then use the .sum() method to give the total number of duplicate rows inside the dataframe.

In [21]:
movies.duplicated().sum()

0

## `Math Methods`

In [22]:
list_ = [
    [92, 90, 99],
    [33, 45, 20],
    [95, 95, 97]
]
students = pd.DataFrame(list_, 
                        index = ['Abhishek', 'Amrusha', 'Priyanka'], 
                        columns = ['phy', 'chem', 'math'])

students

Unnamed: 0,phy,chem,math
Abhishek,92,90,99
Amrusha,33,45,20
Priyanka,95,95,97


In [23]:
students.sum()

phy     220
chem    230
math    216
dtype: int64

    To perform the row-wise sum, we can pass axis = 1.

In [24]:
students.sum(axis = 1)

Abhishek    281
Amrusha      98
Priyanka    287
dtype: int64