# PANDAS

---

It works same like storing data in a Excel Sheet
But the Limitations of Excel are

- Limitation By Size (only million rows)
- Complex data Transformation
- Automation
- Cross platform Capabilities

In Pandas it is a array

- Series (1D array)
- Dataframes (2D array)

Features are atrributes and the data rae called observation

### Excel Terminology -> Pandas terminology

- Worksheet -> Dataframe
- Column -> Series
- Row Heading or number -> Index
- Empty cell -> NaN


## 1. Creating a dataframe from an array

### 1.1 Option 1


In [365]:
import pandas as pd
import numpy as np

In [366]:
# creating a array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# creating a dataframe form the array using pandas
df=pd.DataFrame(data,columns=['col1','col2','col3'],index=['row1','row2','row3']) 

# printing the dataframe
df

Unnamed: 0,col1,col2,col3
row1,1,2,3
row2,4,5,6
row3,7,8,9


### 1.2 option 2(Creating a array with list)


In [367]:
# creating a array with list
my_list=[['Sutapa',19,'Me'],
           ['Sandipan',15,'Brother'],
           ['Prity',20,'Friend']]
# creating a dataframe from the list using pandas
df=pd.DataFrame(my_list,columns=['Name','Age','Relation'],index=['A.','B.','C.'])
df

Unnamed: 0,Name,Age,Relation
A.,Sutapa,19,Me
B.,Sandipan,15,Brother
C.,Prity,20,Friend


## 2. By using a Dictionary


In [368]:
# list of the dictionaries
states=['West Bengal','Assam','Bihar','Sikkim']
Population=[10000000,2000000,30000000,25000]
# creating a dictionary out of the list
my_dict={'states':states,'Population':Population}

# creating a dataframe from the dictionary using pandas
df=pd.DataFrame(my_dict,columns=['states','Population'])
df

Unnamed: 0,states,Population
0,West Bengal,10000000
1,Assam,2000000
2,Bihar,30000000
3,Sikkim,25000


## 3. Creating a Dataframe from a CSV file


In [369]:
# reading a csv file
df=pd.read_csv('C:/Users/Lenovo/Downloads/StudentsPerformance.csv')
df.head() # displaying the first 5 rows of the dataframe
df.tail() # displaying the last 5 rows of the dataframe
df.tail(2) # displaying the last 10 rows of the dataframe

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
998,female,group D,some college,standard,completed,68,78,77
999,female,group D,some college,free/reduced,none,77,86,86


# 1.DataFrames and Its properties


In [370]:
import pandas as pd
# reading a csv file
df=pd.read_csv('C:/Users/Lenovo/Downloads/StudentsPerformance.csv')

### 1. Attributes


In [371]:
df.shape # displaying the shape of the dataframe (rows, columns) No of rows and columns

(1000, 8)

In [372]:
df.index # displaying the index of the dataframe

RangeIndex(start=0, stop=1000, step=1)

In [373]:
df.columns # displaying the columns of the dataframe

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

In [374]:
df.dtypes # displaying the data types of the columns of the dataframe

gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

### 2. Methods


In [375]:
df.info() # displaying the information about the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [376]:
df.describe() # displaying the statistical information about the dataframe

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


### 3. Functions


In [377]:
# Obtaining the length of the dataframe
len(df) # returns the number of rows in the dataframe

1000

In [378]:
# obtainting the number of columns in the dataframe
len(df.columns) # returns the number of columns in the dataframe

8

In [379]:
# obtaining the lowest/first column of the dataframe
min(df.columns) # returns the lowest column of the dataframe

'gender'

In [380]:
# obtaining the last column of the dataframe
max(df.columns) # returns the last column of the dataframe

'writing score'

In [381]:
#obtaining the highest index of the dataframe
max(df.index) # returns the highest index of the dataframe

999

In [382]:
# Rounding the values in the dataframe
df.round(3) # rounding the values in the dataframe to 3 decimal places

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


## 1. Working With columns in DataFrame


In [383]:
import pandas as pd
# reading a csv file
df=pd.read_csv('C:/Users/Lenovo/Downloads/StudentsPerformance.csv')

### 1. Selecting One column

#### 1.1 Syntax 1 (Recommended [])


In [384]:
# Selecting a column with []
df['gender'] # gives a column which is like 1d arrays called series

0      female
1      female
2      female
3        male
4        male
        ...  
995    female
996      male
997    female
998    female
999    female
Name: gender, Length: 1000, dtype: object

In [385]:
# checking its data type
type(df['gender']) 

pandas.core.series.Series

In [386]:
# Series: Attributes and Methods
df['gender'].index
df['gender'].head() 

0    female
1    female
2    female
3      male
4      male
Name: gender, dtype: object

#### 1.2 Syntax 2


In [387]:
# Selecting a column with dot notation
df.gender

0      female
1      female
2      female
3        male
4        male
        ...  
995    female
996      male
997    female
998    female
999    female
Name: gender, Length: 1000, dtype: object

In [388]:
# Pitfalls of using dot notation
#df.math score # gives an error as it does not allow spaces in the column name
df['math score'] # this works as it is a valid column name with spaces

0      72
1      69
2      90
3      47
4      76
       ..
995    88
996    62
997    59
998    68
999    77
Name: math score, Length: 1000, dtype: int64

### 2. Selecting two or more columns


In [389]:
# selecting 2 columns using[[]]
df[['gender','math score']] # gives a dataframe with 2 columns

Unnamed: 0,gender,math score
0,female,72
1,female,69
2,female,90
3,male,47
4,male,76
...,...,...
995,female,88
996,male,62
997,female,59
998,female,68


In [390]:
# checking out the data type of the two columns
df[['gender','math score']].dtypes
# datatype of the selection
type(df[['gender','math score']]) # gives a dataframe

pandas.core.frame.DataFrame

### 3. Adding a new Column

#### 1.1 Adding a Column with a scalar value


In [391]:
# adding a new column to dataframe
df['Language Score']= 70
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Language Score
0,female,group B,bachelor's degree,standard,none,72,72,74,70
1,female,group C,some college,standard,completed,69,90,88,70
2,female,group B,master's degree,standard,none,90,95,93,70
3,male,group A,associate's degree,free/reduced,none,47,57,44,70
4,male,group C,some college,standard,none,76,78,75,70


#### 1.2 Adding a new Column with an array


In [392]:
# The data frame has 1000 elements so we need to create an array of 1000 elements
# import numpy 
import numpy as np
# creating an array of 1000 elements
language_score=np.arange(0, 1000) # 0 to 999
df['Language Score']=language_score
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Language Score
0,female,group B,bachelor's degree,standard,none,72,72,74,0
1,female,group C,some college,standard,completed,69,90,88,1
2,female,group B,master's degree,standard,none,90,95,93,2
3,male,group A,associate's degree,free/reduced,none,47,57,44,3
4,male,group C,some college,standard,none,76,78,75,4
...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,995
996,male,group C,high school,free/reduced,none,62,55,55,996
997,female,group C,high school,free/reduced,completed,59,71,65,997
998,female,group D,some college,standard,completed,68,78,77,998


In [393]:
# creating a random no between 50 to 100
language_score=np.random.randint(50,100,size=1000)
# for creating a array with float value
np.random.uniform(50,100,size=1000)
df['Language Score']=language_score
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Language Score
0,female,group B,bachelor's degree,standard,none,72,72,74,59
1,female,group C,some college,standard,completed,69,90,88,99
2,female,group B,master's degree,standard,none,90,95,93,61
3,male,group A,associate's degree,free/reduced,none,47,57,44,83
4,male,group C,some college,standard,none,76,78,75,56
...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,69
996,male,group C,high school,free/reduced,none,62,55,55,73
997,female,group C,high school,free/reduced,completed,59,71,65,71
998,female,group D,some college,standard,completed,68,78,77,56


### 4. Adding multiple Column with assign and insert

#### 1.1 assign()


In [394]:
import numpy as np
score1=np.random.randint(50,100,size=1000)
score2=np.random.randint(35,80,size=1000)
# creating a series
series1=pd.Series(score1,name='Science Score')
series2=pd.Series(score2,name='History Score')  

# Using assign() to add multiple columns
# df=df.assign(Science=series1,History=series2)

# pitfalls: cant insert the series at a particular Index
# also the name cant contain spaces


#### 1.2 insert() recommended


In [395]:
# insert() method to add multiple columns
df.insert(5,'Science Score',series1)
df.insert(8,'History Score',series2)
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,Science Score,math score,reading score,History Score,writing score,Language Score
0,female,group B,bachelor's degree,standard,none,68,72,72,43,74,59
1,female,group C,some college,standard,completed,93,69,90,55,88,99
2,female,group B,master's degree,standard,none,80,90,95,67,93,61
3,male,group A,associate's degree,free/reduced,none,61,47,57,38,44,83
4,male,group C,some college,standard,none,83,76,78,66,75,56
...,...,...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,90,88,99,73,95,69
996,male,group C,high school,free/reduced,none,96,62,55,48,55,73
997,female,group C,high school,free/reduced,completed,68,59,71,56,65,71
998,female,group D,some college,standard,completed,70,68,78,78,77,56


# 2. Math Operation


In [396]:
import pandas as pd
import numpy as np
# reading a csv file

## 1.1 Operations in Columns


In [397]:
# Selecting a column and calculate the total sums
df=df.round()
df['Science Score'].sum() # gives the sum of the Science Score column

np.int64(74487)

In [398]:
df['math score'].count()
df['math score'].mean() # gives the mean of the math score column
df['math score'].median() # gives the median of the math score column

np.float64(66.0)

In [399]:
df['math score'].max() # gives the maximum value of the math score column

np.int64(100)

In [400]:
df['Science Score'].min() # gives the minimum value of the Science Score column

np.int32(50)

## 1.2 Operations in rows


In [401]:
# creating a series for total score for each student.
total_score=(df['Science Score']+df['History Score']+df['math score']+df['Language Score']+df['reading score']+df['writing score'])
# inserting the total score column at index 11
df.insert(11,'Total Score',total_score)
# Making another series for percentage
percentage=(df['Total Score']/600)*100
# inserting the percentage column at index 12
df.insert(12,'Percentage',percentage)
# Rounding the percentage to 2 decimal places
df['Percentage'] = df['Percentage'].round(2)
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,Science Score,math score,reading score,History Score,writing score,Language Score,Total Score,Percentage
0,female,group B,bachelor's degree,standard,none,68,72,72,43,74,59,388,64.67
1,female,group C,some college,standard,completed,93,69,90,55,88,99,494,82.33
2,female,group B,master's degree,standard,none,80,90,95,67,93,61,486,81.00
3,male,group A,associate's degree,free/reduced,none,61,47,57,38,44,83,330,55.00
4,male,group C,some college,standard,none,83,76,78,66,75,56,434,72.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,90,88,99,73,95,69,514,85.67
996,male,group C,high school,free/reduced,none,96,62,55,48,55,73,389,64.83
997,female,group C,high school,free/reduced,completed,68,59,71,56,65,71,390,65.00
998,female,group D,some college,standard,completed,70,68,78,78,77,56,427,71.17


## 2. Value Counts


In [402]:
# Counting gender elements
# len function
len(df['gender'])
# .count() method
df['gender'].count()

np.int64(1000)

In [403]:
# counting Gender elements by category
df['gender'].value_counts()

gender
female    518
male      482
Name: count, dtype: int64

In [404]:
# return the relative frequency of value_count (divide all values by the sum of values)
df['gender'].value_counts(normalize=True)

gender
female    0.518
male      0.482
Name: proportion, dtype: float64

In [405]:
# Counting 'parental level of education' elements by category
df['parental level of education'].value_counts()

parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: count, dtype: int64

## 3. Sort a DataFrame


In [406]:
# Sort by one column
df.sort_values(by='Total Score') # sorts the dataframe by math score in ascending order

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,Science Score,math score,reading score,History Score,writing score,Language Score,Total Score,Percentage
363,female,group D,some high school,free/reduced,none,58,27,34,39,32,56,246,41.00
59,female,group C,some high school,free/reduced,none,74,0,17,70,10,80,251,41.83
145,female,group C,some college,free/reduced,none,60,22,39,40,33,63,257,42.83
338,female,group B,some high school,free/reduced,none,84,24,38,39,27,57,269,44.83
980,female,group B,high school,free/reduced,none,82,8,24,67,23,72,276,46.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
546,female,group A,some high school,standard,completed,87,92,100,63,97,93,532,88.67
104,male,group C,some college,standard,completed,97,98,86,75,90,88,534,89.00
458,female,group E,bachelor's degree,standard,none,94,100,100,44,100,97,535,89.17
580,female,group D,some high school,standard,none,88,81,97,76,96,99,537,89.50


In [407]:
# Sort by a column in descending order
df.sort_values(by='Percentage', ascending=False) # sorts the dataframe by Total Score in descending order

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,Science Score,math score,reading score,History Score,writing score,Language Score,Total Score,Percentage
580,female,group D,some high school,standard,none,88,81,97,76,96,99,537,89.50
149,male,group E,associate's degree,free/reduced,completed,97,100,100,72,93,75,537,89.50
458,female,group E,bachelor's degree,standard,none,94,100,100,44,100,97,535,89.17
104,male,group C,some college,standard,completed,97,98,86,75,90,88,534,89.00
546,female,group A,some high school,standard,completed,87,92,100,63,97,93,532,88.67
...,...,...,...,...,...,...,...,...,...,...,...,...,...
980,female,group B,high school,free/reduced,none,82,8,24,67,23,72,276,46.00
338,female,group B,some high school,free/reduced,none,84,24,38,39,27,57,269,44.83
145,female,group C,some college,free/reduced,none,60,22,39,40,33,63,257,42.83
59,female,group C,some high school,free/reduced,none,74,0,17,70,10,80,251,41.83


In [408]:
# Sorting Data by multiple columns
df.sort_values(['math score', 'Language Score'], ascending=[False, False]) # sorts the dataframe first by math score in descending order and Science Score in descending order

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,Science Score,math score,reading score,History Score,writing score,Language Score,Total Score,Percentage
458,female,group E,bachelor's degree,standard,none,94,100,100,44,100,97,535,89.17
962,female,group E,associate's degree,standard,none,54,100,100,66,100,88,508,84.67
916,male,group E,bachelor's degree,standard,completed,72,100,100,49,100,84,505,84.17
623,male,group A,some college,standard,completed,76,100,96,48,86,82,488,81.33
625,male,group D,some college,standard,completed,55,100,97,69,99,76,496,82.67
...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,female,group C,some college,free/reduced,none,60,22,39,40,33,63,257,42.83
787,female,group B,some college,standard,none,76,19,38,76,32,91,332,55.33
17,female,group B,some high school,free/reduced,none,92,18,32,73,28,50,293,48.83
980,female,group B,high school,free/reduced,none,82,8,24,67,23,72,276,46.00


In [409]:
# Sort descending by multiple columns and update dataframe
df.sort_values(['math score', 'Language Score'], ascending=[False, False], inplace=True) # sorts the dataframe first by math score in descending order and Science Score in descending order and updates the dataframe
# inplace is used to update the dataframe in place without creating a new dataframe
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,Science Score,math score,reading score,History Score,writing score,Language Score,Total Score,Percentage
458,female,group E,bachelor's degree,standard,none,94,100,100,44,100,97,535,89.17
962,female,group E,associate's degree,standard,none,54,100,100,66,100,88,508,84.67
916,male,group E,bachelor's degree,standard,completed,72,100,100,49,100,84,505,84.17
623,male,group A,some college,standard,completed,76,100,96,48,86,82,488,81.33
625,male,group D,some college,standard,completed,55,100,97,69,99,76,496,82.67
...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,female,group C,some college,free/reduced,none,60,22,39,40,33,63,257,42.83
787,female,group B,some college,standard,none,76,19,38,76,32,91,332,55.33
17,female,group B,some high school,free/reduced,none,92,18,32,73,28,50,293,48.83
980,female,group B,high school,free/reduced,none,82,8,24,67,23,72,276,46.00


In [410]:
# sort descending with a key function
df.sort_values(by='race/ethnicity', key=lambda col: col.str.lower(), ascending=True) # sorts the dataframe by race/ethnicity in descending order using a key function
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,Science Score,math score,reading score,History Score,writing score,Language Score,Total Score,Percentage
458,female,group E,bachelor's degree,standard,none,94,100,100,44,100,97,535,89.17
962,female,group E,associate's degree,standard,none,54,100,100,66,100,88,508,84.67
916,male,group E,bachelor's degree,standard,completed,72,100,100,49,100,84,505,84.17
623,male,group A,some college,standard,completed,76,100,96,48,86,82,488,81.33
625,male,group D,some college,standard,completed,55,100,97,69,99,76,496,82.67
...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,female,group C,some college,free/reduced,none,60,22,39,40,33,63,257,42.83
787,female,group B,some college,standard,none,76,19,38,76,32,91,332,55.33
17,female,group B,some high school,free/reduced,none,92,18,32,73,28,50,293,48.83
980,female,group B,high school,free/reduced,none,82,8,24,67,23,72,276,46.00


## Creating Index


In [411]:
import numpy as np
import pandas as pd
import random
df=pd.read_csv('C:/Users/Lenovo/Downloads/StudentsPerformance.csv')

In [412]:
# creating a non-repetitive value for the index
new_index=np.arange(0,1000)
# shuffling the index
random.shuffle(new_index) # Tis will give a suffled index

# assigning the new index to the dataframe
df.index=new_index
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
50,female,group B,bachelor's degree,standard,none,72,72,74
346,female,group C,some college,standard,completed,69,90,88
154,female,group B,master's degree,standard,none,90,95,93
307,male,group A,associate's degree,free/reduced,none,47,57,44
365,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
251,female,group E,master's degree,standard,completed,88,99,95
585,male,group C,high school,free/reduced,none,62,55,55
691,female,group C,high school,free/reduced,completed,59,71,65
5,female,group D,some college,standard,completed,68,78,77


In [413]:
# creating a new column with a new index
df.insert(0,'new index',new_index)
df

Unnamed: 0,new index,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
50,50,female,group B,bachelor's degree,standard,none,72,72,74
346,346,female,group C,some college,standard,completed,69,90,88
154,154,female,group B,master's degree,standard,none,90,95,93
307,307,male,group A,associate's degree,free/reduced,none,47,57,44
365,365,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...,...
251,251,female,group E,master's degree,standard,completed,88,99,95
585,585,male,group C,high school,free/reduced,none,62,55,55
691,691,female,group C,high school,free/reduced,completed,59,71,65
5,5,female,group D,some college,standard,completed,68,78,77


### 2. setting te new index column as the index


In [414]:
# set new index as the index
df.set_index('new index', inplace=True) # sets the new index column as the index of the dataframe
df

Unnamed: 0_level_0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
new index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
50,female,group B,bachelor's degree,standard,none,72,72,74
346,female,group C,some college,standard,completed,69,90,88
154,female,group B,master's degree,standard,none,90,95,93
307,male,group A,associate's degree,free/reduced,none,47,57,44
365,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
251,female,group E,master's degree,standard,completed,88,99,95
585,male,group C,high school,free/reduced,none,62,55,55
691,female,group C,high school,free/reduced,completed,59,71,65
5,female,group D,some college,standard,completed,68,78,77


In [417]:
# Sorting the dataframe acc to new index
# we have already set the new index as the index of the dataFrame
df.sort_index(ascending=False) #1
df.sort_index() # 2
df.sort_index(inplace=True) # 3 asc & implace
df

Unnamed: 0_level_0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
new index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,male,group E,high school,free/reduced,none,55,56,51
1,female,group C,some college,free/reduced,none,62,67,62
2,female,group C,some college,standard,none,59,71,70
3,male,group C,master's degree,free/reduced,completed,72,66,72
4,male,group D,associate's degree,standard,none,40,52,43
...,...,...,...,...,...,...,...,...
995,female,group C,associate's degree,standard,completed,67,84,81
996,female,group B,some high school,standard,completed,32,51,44
997,male,group B,associate's degree,standard,none,65,54,57
998,female,group D,high school,standard,completed,88,99,100


## Renaming Column


In [423]:
# rename column and overwrite the existing column
df=df.rename(columns={'gender':'Gender'})
df

Unnamed: 0_level_0,Gender,race/ethnicity,parental level of education,lunch,test preparation course,MS,reading score,writing score
new index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,male,group E,high school,free/reduced,none,55,56,51
1,female,group C,some college,free/reduced,none,62,67,62
2,female,group C,some college,standard,none,59,71,70
3,male,group C,master's degree,free/reduced,completed,72,66,72
4,male,group D,associate's degree,standard,none,40,52,43
...,...,...,...,...,...,...,...,...
995,female,group C,associate's degree,standard,completed,67,84,81
996,female,group B,some high school,standard,completed,32,51,44
997,male,group B,associate's degree,standard,none,65,54,57
998,female,group D,high school,standard,completed,88,99,100


In [None]:
# renaming multiple columns and update the dataframe with inplace argument
df.rename(columns={'math score':'MS','reading score':'RS'}, inplace=True)

In [427]:
# showing the dataframe
df

Unnamed: 0_level_0,Gender,race/ethnicity,parental level of education,lunch,test preparation course,MS,RS,writing score
new index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,male,group E,high school,free/reduced,none,55,56,51
1,female,group C,some college,free/reduced,none,62,67,62
2,female,group C,some college,standard,none,59,71,70
3,male,group C,master's degree,free/reduced,completed,72,66,72
4,male,group D,associate's degree,standard,none,40,52,43
...,...,...,...,...,...,...,...,...
995,female,group C,associate's degree,standard,completed,67,84,81
996,female,group B,some high school,standard,completed,32,51,44
997,male,group B,associate's degree,standard,none,65,54,57
998,female,group D,high school,standard,completed,88,99,100


### 2. Renaming Index


In [430]:
# renaming index 0,1,2, and update the dataFrame
df.rename(index={0:'A',1:'B',2:'C'}, inplace=True) 
df.head(3)

Unnamed: 0_level_0,Gender,race/ethnicity,parental level of education,lunch,test preparation course,MS,RS,writing score
new index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,male,group E,high school,free/reduced,none,55,56,51
B,female,group C,some college,free/reduced,none,62,67,62
C,female,group C,some college,standard,none,59,71,70


# WEB SCRAPING - Pandas


In [None]:
# importing numpy and pandas
import numpy as np
import pandas as pd

Target website :https://www.football-data.co.uk/data.php


## 1. Reading from a url


In [431]:
pd.read_csv('https://www.football-data.co.uk/mmz4281/2425/E0.csv')

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA,BFECAHH,BFECAHA
0,E0,16/08/2024,20:00,Man United,Fulham,1,0,H,0,0,...,1.86,2.07,1.83,2.11,1.88,2.11,1.82,2.05,1.90,2.08
1,E0,17/08/2024,12:30,Ipswich,Liverpool,0,2,A,0,0,...,2.05,1.88,2.04,1.90,2.20,2.00,1.99,1.88,2.04,1.93
2,E0,17/08/2024,15:00,Arsenal,Wolves,2,0,H,1,0,...,2.02,1.91,2.00,1.90,2.05,1.93,1.99,1.87,2.02,1.96
3,E0,17/08/2024,15:00,Everton,Brighton,0,3,A,0,1,...,1.87,2.06,1.86,2.07,1.92,2.10,1.83,2.04,1.88,2.11
4,E0,17/08/2024,15:00,Newcastle,Southampton,1,0,H,1,0,...,1.87,2.06,1.88,2.06,1.89,2.10,1.82,2.05,1.89,2.10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
375,E0,25/05/2025,16:00,Newcastle,Everton,0,1,A,0,0,...,2.00,1.85,2.01,1.90,2.01,1.95,1.95,1.91,1.93,2.05
376,E0,25/05/2025,16:00,Nott'm Forest,Chelsea,0,1,A,0,0,...,1.80,2.05,1.86,2.08,1.86,2.08,1.81,2.05,1.86,2.14
377,E0,25/05/2025,16:00,Southampton,Arsenal,1,2,A,0,1,...,2.03,1.83,2.04,1.87,2.07,1.87,2.03,1.83,2.06,1.89
378,E0,25/05/2025,16:00,Tottenham,Brighton,1,4,A,1,0,...,1.95,1.90,2.00,1.93,2.01,1.93,1.95,1.89,2.06,1.93


## 2. Reading from multiple URLs


The urls are:

- https://www.football-data.co.uk/mmz4281/2425/E0.csv
- https://www.football-data.co.uk/mmz4281/2425/E1.csv
- https://www.football-data.co.uk/mmz4281/2425/E2.csv
- https://www.football-data.co.uk/mmz4281/2425/E3.csv
- https://www.football-data.co.uk/mmz4281/2425/EC.csv


In [None]:
# root is common for all the urls for a particular season
root="https://www.football-data.co.uk/mmz4281/2425/"
# setting a league array where all the Es woulf be stored
leagues=['E0','E1','E2','E3','EC']
frames=[] # for storing the datas
for league in leagues:
    # making the url
    url = root + league + '.csv'
    frames.append(pd.read_csv(url))
frames[1] # displaying the second dataframe (E1)

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA,BFECAHH,BFECAHA
0,E1,09/08/2024,20:00,Blackburn,Derby,4,2,H,1,0,...,1.85,2.05,1.85,2.06,1.97,2.10,1.85,1.99,1.88,2.11
1,E1,09/08/2024,20:00,Preston,Sheffield United,0,2,A,0,1,...,1.92,1.98,1.93,1.98,1.97,1.99,1.90,1.94,1.94,2.04
2,E1,10/08/2024,12:30,Cardiff,Sunderland,0,2,A,0,1,...,2.11,1.79,2.13,1.80,2.13,1.81,2.09,1.79,2.18,1.83
3,E1,10/08/2024,12:30,Hull,Bristol City,1,1,D,0,0,...,1.99,1.91,2.00,1.91,2.04,1.91,1.99,1.87,2.03,1.96
4,E1,10/08/2024,12:30,Leeds,Portsmouth,3,3,D,1,2,...,1.92,1.98,1.93,1.95,2.00,1.98,1.94,1.89,2.00,1.97
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
547,E1,03/05/2025,12:30,Sheffield United,Blackburn,1,1,D,0,0,...,1.89,2.01,1.90,2.01,1.91,2.11,1.84,2.04,1.89,2.10
548,E1,03/05/2025,12:30,Sunderland,QPR,0,1,A,0,1,...,2.01,1.89,2.04,1.88,2.19,1.89,2.04,1.81,2.16,1.82
549,E1,03/05/2025,12:30,Swansea,Oxford,3,3,D,1,1,...,2.02,1.88,2.03,1.88,2.08,1.89,2.00,1.85,2.09,1.89
550,E1,03/05/2025,12:30,Watford,Sheffield Weds,1,1,D,1,1,...,1.85,2.00,1.86,2.05,2.00,2.05,1.92,1.94,2.00,1.99
