### SESSION 17 - PANDAS DATAFRAMES

#### What is Pandas DataFrame?
- A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
- Used for data manipulation, analysis, and cleaning, featuring labeled rows and columns, support for various data types, and flexibility for adding, removing, and transforming data.
- Pandas DataFrames are widely used in data science and analysis tasks for handling structured data efficiently.
- **Syntax: pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)**
    - **data:** ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    - **index:** Index or array-like
    - **columns:** Index or array-like
    - **dtype:** dtype, default None
    - **copy:** bool or None, default None

#### Creating DataFrames:
**Using lists :**

In [90]:
import pandas as pd
import numpy as np

student_data = [
    [100,95,14],
    [107,87,16],
    [89,78,12]
]
pd.DataFrame(student_data, columns=['iq','marks','package'])

Unnamed: 0,iq,marks,package
0,100,95,14
1,107,87,16
2,89,78,12


**Using dictionary:**

In [91]:
dictionary = {
    'iq':[100,95,14],
    'marks':[107,87,16],
    'package':[89,78,12],
}
students = pd.DataFrame(dictionary)
print(students)

    iq  marks  package
0  100    107       89
1   95     87       78
2   14     16       12


**Using read_csv() function:**

In [92]:
movies = pd.read_csv('DATASETS/S17/movies.csv')
movies

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
0,Uri: The Surgical Strike,tt8291224,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Uri:_The_Surgica...,Uri: The Surgical Strike,Uri: The Surgical Strike,0,2019,138,Action|Drama|War,8.4,35112,Divided over five chapters the film chronicle...,Indian army special forces execute a covert op...,,Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Ga...,4 wins,11 January 2019 (USA)
1,Battalion 609,tt9472208,,https://en.wikipedia.org/wiki/Battalion_609,Battalion 609,Battalion 609,0,2019,131,War,4.1,73,The story revolves around a cricket match betw...,The story of Battalion 609 revolves around a c...,,Vicky Ahuja|Shoaib Ibrahim|Shrikant Kamat|Elen...,,11 January 2019 (India)
2,The Accidental Prime Minister (film),tt6986710,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/The_Accidental_P...,The Accidental Prime Minister,The Accidental Prime Minister,0,2019,112,Biography|Drama,6.1,5549,Based on the memoir by Indian policy analyst S...,Explores Manmohan Singh's tenure as the Prime ...,,Anupam Kher|Akshaye Khanna|Aahana Kumra|Atul S...,,11 January 2019 (USA)
3,Why Cheat India,tt8108208,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Why_Cheat_India,Why Cheat India,Why Cheat India,0,2019,121,Crime|Drama,6.0,1891,The movie focuses on existing malpractices in ...,The movie focuses on existing malpractices in ...,,Emraan Hashmi|Shreya Dhanwanthary|Snighdadeep ...,,18 January 2019 (USA)
4,Evening Shadows,tt6028796,,https://en.wikipedia.org/wiki/Evening_Shadows,Evening Shadows,Evening Shadows,0,2018,102,Drama,7.3,280,While gay rights and marriage equality has bee...,Under the 'Evening Shadows' truth often plays...,,Mona Ambegaonkar|Ananth Narayan Mahadevan|Deva...,17 wins & 1 nomination,11 January 2019 (India)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1624,Tera Mera Saath Rahen,tt0301250,https://upload.wikimedia.org/wikipedia/en/2/2b...,https://en.wikipedia.org/wiki/Tera_Mera_Saath_...,Tera Mera Saath Rahen,Tera Mera Saath Rahen,0,2001,148,Drama,4.9,278,Raj Dixit lives with his younger brother Rahu...,A man is torn between his handicapped brother ...,,Ajay Devgn|Sonali Bendre|Namrata Shirodkar|Pre...,,7 November 2001 (India)
1625,Yeh Zindagi Ka Safar,tt0298607,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S...,Yeh Zindagi Ka Safar,Yeh Zindagi Ka Safar,0,2001,146,Drama,3.0,133,Hindi pop-star Sarina Devan lives a wealthy ...,A singer finds out she was adopted when the ed...,,Ameesha Patel|Jimmy Sheirgill|Nafisa Ali|Gulsh...,,16 November 2001 (India)
1626,Sabse Bada Sukh,tt0069204,,https://en.wikipedia.org/wiki/Sabse_Bada_Sukh,Sabse Bada Sukh,Sabse Bada Sukh,0,2018,\N,Comedy|Drama,6.1,13,Village born Lalloo re-locates to Bombay and ...,Village born Lalloo re-locates to Bombay and ...,,Vijay Arora|Asrani|Rajni Bala|Kumud Damle|Utpa...,,
1627,Daaka,tt10833860,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Daaka,Daaka,Daaka,0,2019,136,Action,7.4,38,Shinda tries robbing a bank so he can be wealt...,Shinda tries robbing a bank so he can be wealt...,,Gippy Grewal|Zareen Khan|,,1 November 2019 (USA)


In [93]:
ipl = pd.read_csv('DATASETS/S17/ipl-matches.csv')
ipl

Unnamed: 0,ID,City,Date,Season,MatchNumber,Team1,Team2,Venue,TossWinner,TossDecision,SuperOver,WinningTeam,WonBy,Margin,method,Player_of_Match,Team1Players,Team2Players,Umpire1,Umpire2
0,1312200,Ahmedabad,2022-05-29,2022,Final,Rajasthan Royals,Gujarat Titans,"Narendra Modi Stadium, Ahmedabad",Rajasthan Royals,bat,N,Gujarat Titans,Wickets,7.0,,HH Pandya,"['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...","['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pan...",CB Gaffaney,Nitin Menon
1,1312199,Ahmedabad,2022-05-27,2022,Qualifier 2,Royal Challengers Bangalore,Rajasthan Royals,"Narendra Modi Stadium, Ahmedabad",Rajasthan Royals,field,N,Rajasthan Royals,Wickets,7.0,,JC Buttler,"['V Kohli', 'F du Plessis', 'RM Patidar', 'GJ ...","['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...",CB Gaffaney,Nitin Menon
2,1312198,Kolkata,2022-05-25,2022,Eliminator,Royal Challengers Bangalore,Lucknow Super Giants,"Eden Gardens, Kolkata",Lucknow Super Giants,field,N,Royal Challengers Bangalore,Runs,14.0,,RM Patidar,"['V Kohli', 'F du Plessis', 'RM Patidar', 'GJ ...","['Q de Kock', 'KL Rahul', 'M Vohra', 'DJ Hooda...",J Madanagopal,MA Gough
3,1312197,Kolkata,2022-05-24,2022,Qualifier 1,Rajasthan Royals,Gujarat Titans,"Eden Gardens, Kolkata",Gujarat Titans,field,N,Gujarat Titans,Wickets,7.0,,DA Miller,"['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...","['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pan...",BNJ Oxenford,VK Sharma
4,1304116,Mumbai,2022-05-22,2022,70,Sunrisers Hyderabad,Punjab Kings,"Wankhede Stadium, Mumbai",Sunrisers Hyderabad,bat,N,Punjab Kings,Wickets,5.0,,Harpreet Brar,"['PK Garg', 'Abhishek Sharma', 'RA Tripathi', ...","['JM Bairstow', 'S Dhawan', 'M Shahrukh Khan',...",AK Chaudhary,NA Patwardhan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
945,335986,Kolkata,2008-04-20,2007/08,4,Kolkata Knight Riders,Deccan Chargers,Eden Gardens,Deccan Chargers,bat,N,Kolkata Knight Riders,Wickets,5.0,,DJ Hussey,"['WP Saha', 'BB McCullum', 'RT Ponting', 'SC G...","['AC Gilchrist', 'Y Venugopal Rao', 'VVS Laxma...",BF Bowden,K Hariharan
946,335985,Mumbai,2008-04-20,2007/08,5,Mumbai Indians,Royal Challengers Bangalore,Wankhede Stadium,Mumbai Indians,bat,N,Royal Challengers Bangalore,Wickets,5.0,,MV Boucher,"['L Ronchi', 'ST Jayasuriya', 'DJ Thornely', '...","['S Chanderpaul', 'R Dravid', 'LRPL Taylor', '...",SJ Davis,DJ Harper
947,335984,Delhi,2008-04-19,2007/08,3,Delhi Daredevils,Rajasthan Royals,Feroz Shah Kotla,Rajasthan Royals,bat,N,Delhi Daredevils,Wickets,9.0,,MF Maharoof,"['G Gambhir', 'V Sehwag', 'S Dhawan', 'MK Tiwa...","['T Kohli', 'YK Pathan', 'SR Watson', 'M Kaif'...",Aleem Dar,GA Pratapkumar
948,335983,Chandigarh,2008-04-19,2007/08,2,Kings XI Punjab,Chennai Super Kings,"Punjab Cricket Association Stadium, Mohali",Chennai Super Kings,bat,N,Chennai Super Kings,Runs,33.0,,MEK Hussey,"['K Goel', 'JR Hopes', 'KC Sangakkara', 'Yuvra...","['PA Patel', 'ML Hayden', 'MEK Hussey', 'MS Dh...",MR Benson,SL Shastri


**Pandas DataFrame Attribute:**


**shape attribute:**
- This attribute is used to display the total number of rows and columns of a particular data frame.
-  For example, if we have 3 rows and 2 columns in a DataFrame then the shape will be (3,2). 

In [94]:
# shape
# return tuple with dimensions of dataframe
print('Shape of the DataFrame:',movies.shape)
print('Shape of the DataFrame:',ipl.shape)

Shape of the DataFrame: (1629, 18)
Shape of the DataFrame: (950, 20)


**dtypes attribute:**
- this attribute is to display the data type for each column of a particular dataframe.

In [95]:
#dtypes
# check/print all columns dtypes
print(movies.dtypes)
print(ipl.dtypes)

title_x              object
imdb_id              object
poster_path          object
wiki_link            object
title_y              object
original_title       object
is_adult              int64
year_of_release       int64
runtime              object
genres               object
imdb_rating         float64
imdb_votes            int64
story                object
summary              object
tagline              object
actors               object
wins_nominations     object
release_date         object
dtype: object
ID                   int64
City                object
Date                object
Season              object
MatchNumber         object
Team1               object
Team2               object
Venue               object
TossWinner          object
TossDecision        object
SuperOver           object
WinningTeam         object
WonBy               object
Margin             float64
method              object
Player_of_Match     object
Team1Players        object
Team2Players        obj

**index attribute:**
- The index attribute is used to display the row labels of a data frame object
- Also We can give own index range

In [96]:
# index
print(movies.index)
print(ipl.index)

RangeIndex(start=0, stop=1629, step=1)
RangeIndex(start=0, stop=950, step=1)


**columns attribute:**
- This attribute is used to fetch the label values for columns present in a particular data frame.

In [97]:
# columns
# The column labels of the DataFrame.
print(movies.columns)
print(ipl.columns)
print(students.columns)

Index(['title_x', 'imdb_id', 'poster_path', 'wiki_link', 'title_y',
       'original_title', 'is_adult', 'year_of_release', 'runtime', 'genres',
       'imdb_rating', 'imdb_votes', 'story', 'summary', 'tagline', 'actors',
       'wins_nominations', 'release_date'],
      dtype='object')
Index(['ID', 'City', 'Date', 'Season', 'MatchNumber', 'Team1', 'Team2',
       'Venue', 'TossWinner', 'TossDecision', 'SuperOver', 'WinningTeam',
       'WonBy', 'Margin', 'method', 'Player_of_Match', 'Team1Players',
       'Team2Players', 'Umpire1', 'Umpire2'],
      dtype='object')
Index(['iq', 'marks', 'package'], dtype='object')


**values attribute:**
- This attribute is used to represent the values/data of dataframe in NumPy array form.

In [98]:
# values
print(ipl.values)
print(movies.values)
print(students.values)

[[1312200 'Ahmedabad' '2022-05-29' ...
  "['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pandya', 'DA Miller', 'R Tewatia', 'Rashid Khan', 'R Sai Kishore', 'LH Ferguson', 'Yash Dayal', 'Mohammed Shami']"
  'CB Gaffaney' 'Nitin Menon']
 [1312199 'Ahmedabad' '2022-05-27' ...
  "['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D Padikkal', 'SO Hetmyer', 'R Parag', 'R Ashwin', 'TA Boult', 'YS Chahal', 'M Prasidh Krishna', 'OC McCoy']"
  'CB Gaffaney' 'Nitin Menon']
 [1312198 'Kolkata' '2022-05-25' ...
  "['Q de Kock', 'KL Rahul', 'M Vohra', 'DJ Hooda', 'MP Stoinis', 'E Lewis', 'KH Pandya', 'PVD Chameera', 'Mohsin Khan', 'Avesh Khan', 'Ravi Bishnoi']"
  'J Madanagopal' 'MA Gough']
 ...
 [335984 'Delhi' '2008-04-19' ...
  "['T Kohli', 'YK Pathan', 'SR Watson', 'M Kaif', 'DS Lehmann', 'RA Jadeja', 'M Rawat', 'D Salunkhe', 'SK Warne', 'SK Trivedi', 'MM Patel']"
  'Aleem Dar' 'GA Pratapkumar']
 [335983 'Chandigarh' '2008-04-19' ...
  "['PA Patel', 'ML Hayden', 'MEK Hussey', 'MS Dhoni', 'SK Raina', 'JDP 

##### Pandas DataFrame method/function:

**head() method:**

In [99]:
print(movies.head()) # default is 5
print(movies.head(3))

                                title_x    imdb_id  \
0              Uri: The Surgical Strike  tt8291224   
1                         Battalion 609  tt9472208   
2  The Accidental Prime Minister (film)  tt6986710   
3                       Why Cheat India  tt8108208   
4                       Evening Shadows  tt6028796   

                                         poster_path  \
0  https://upload.wikimedia.org/wikipedia/en/thum...   
1                                                NaN   
2  https://upload.wikimedia.org/wikipedia/en/thum...   
3  https://upload.wikimedia.org/wikipedia/en/thum...   
4                                                NaN   

                                           wiki_link  \
0  https://en.wikipedia.org/wiki/Uri:_The_Surgica...   
1        https://en.wikipedia.org/wiki/Battalion_609   
2  https://en.wikipedia.org/wiki/The_Accidental_P...   
3      https://en.wikipedia.org/wiki/Why_Cheat_India   
4      https://en.wikipedia.org/wiki/Evening_Shadows   

 

**tail() method:**

In [100]:
print(movies.tail()) # default is 5
print(ipl.tail(3))

                    title_x     imdb_id  \
1624  Tera Mera Saath Rahen   tt0301250   
1625   Yeh Zindagi Ka Safar   tt0298607   
1626        Sabse Bada Sukh   tt0069204   
1627                  Daaka  tt10833860   
1628               Humsafar   tt2403201   

                                            poster_path  \
1624  https://upload.wikimedia.org/wikipedia/en/2/2b...   
1625  https://upload.wikimedia.org/wikipedia/en/thum...   
1626                                                NaN   
1627  https://upload.wikimedia.org/wikipedia/en/thum...   
1628  https://upload.wikimedia.org/wikipedia/en/thum...   

                                              wiki_link  \
1624  https://en.wikipedia.org/wiki/Tera_Mera_Saath_...   
1625  https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S...   
1626      https://en.wikipedia.org/wiki/Sabse_Bada_Sukh   
1627                https://en.wikipedia.org/wiki/Daaka   
1628             https://en.wikipedia.org/wiki/Humsafar   

                    title_y    

**sample() method:**
- sample() is used to generate a sample random row or column from the function caller data frame.

In [101]:
# sample()
# return random rows
print(ipl.sample())

         ID   City        Date Season MatchNumber             Team1  \
624  598001  Delhi  2013-04-06   2013           4  Delhi Daredevils   

                Team2             Venue        TossWinner TossDecision  \
624  Rajasthan Royals  Feroz Shah Kotla  Rajasthan Royals          bat   

    SuperOver       WinningTeam WonBy  Margin method Player_of_Match  \
624         N  Rajasthan Royals  Runs     5.0    NaN        R Dravid   

                                          Team1Players  \
624  ['DA Warner', 'UBT Chand', 'DPMD Jayawardene',...   

                                          Team2Players Umpire1        Umpire2  
624  ['MDKJ Perera', 'AM Rahane', 'R Dravid', 'STR ...   S Das  C Shamshuddin  


**info() method:**
- info() function is used to get a concise summary of the dataframe. 
- method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage. 
- To get a quick overview of the dataset we use the info() function.

In [102]:
movies.info()
ipl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1629 entries, 0 to 1628
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   title_x           1629 non-null   object 
 1   imdb_id           1629 non-null   object 
 2   poster_path       1526 non-null   object 
 3   wiki_link         1629 non-null   object 
 4   title_y           1629 non-null   object 
 5   original_title    1629 non-null   object 
 6   is_adult          1629 non-null   int64  
 7   year_of_release   1629 non-null   int64  
 8   runtime           1629 non-null   object 
 9   genres            1629 non-null   object 
 10  imdb_rating       1629 non-null   float64
 11  imdb_votes        1629 non-null   int64  
 12  story             1609 non-null   object 
 13  summary           1629 non-null   object 
 14  tagline           557 non-null    object 
 15  actors            1624 non-null   object 
 16  wins_nominations  707 non-null    object 


**describe() method :**

In [103]:
movies.describe()

Unnamed: 0,is_adult,year_of_release,imdb_rating,imdb_votes
count,1629.0,1629.0,1629.0,1629.0
mean,0.0,2010.263966,5.557459,5384.263352
std,0.0,5.381542,1.567609,14552.103231
min,0.0,2001.0,0.0,0.0
25%,0.0,2005.0,4.4,233.0
50%,0.0,2011.0,5.6,1000.0
75%,0.0,2015.0,6.8,4287.0
max,0.0,2019.0,9.4,310481.0


**isnull() method:**

In [104]:
# check each null values and add them.
movies.isnull().sum()

title_x                0
imdb_id                0
poster_path          103
wiki_link              0
title_y                0
original_title         0
is_adult               0
year_of_release        0
runtime                0
genres                 0
imdb_rating            0
imdb_votes             0
story                 20
summary                0
tagline             1072
actors                 5
wins_nominations     922
release_date         107
dtype: int64

**duplicated() method:**

In [105]:
# duplicate = True and no duplicate= False
movies.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1624    False
1625    False
1626    False
1627    False
1628    False
Length: 1629, dtype: bool

In [106]:
dictionary = {
    'iq':[100,95,14,0,0],
    'marks':[107,87,16,0,0],
    'package':[89,78,12,0,0],
}
students = pd.DataFrame(dictionary)
print(students)
print(students.duplicated().sum())

    iq  marks  package
0  100    107       89
1   95     87       78
2   14     16       12
3    0      0        0
4    0      0        0
1


**rename() method:**
- rename() method is used to rename any index, column or row. Renaming of column can also be done by dataframe.columns = [#list]
- **Syntax: DataFrame.rename(mapper,index,columns,axis,copy, inplace,level)**
    - **mapper, index and columns:** Dictionary value, key refers to the old name and value refers to new name. Only one of these parameters can be used at once.
    - **axis:** int or string value, 0/’row’ for Rows and 1/’columns’ for Columns.
    - **copy:** Copies underlying data if True.
    - **inplace:** Makes changes in original Data Frame if True.
    - **level:** Used to specify level in case data frame is having multiple level index.



In [107]:
students.rename(columns={'marks':'%','package':'lpa'}, inplace=True)
print(students)

    iq    %  lpa
0  100  107   89
1   95   87   78
2   14   16   12
3    0    0    0
4    0    0    0


#### Pandas DataFrame Mathematical method:

In [108]:
print(students.sum())# default (column-wise)
print(students.sum(axis=1)) # row-wise(horizontal)

iq     209
%      210
lpa    179
dtype: int64
0    296
1    260
2     42
3      0
4      0
dtype: int64
