# PANDAS

---

It works same like storing data in a Excel Sheet
But the Limitations of Excel are

- Limitation By Size (only million rows)
- Complex data Transformation
- Automation
- Cross platform Capabilities

In Pandas it is a array

- Series (1D array)
- Dataframes (2D array)

Features are atrributes and the data rae called observation

### Excel Terminology -> Pandas terminology

- Worksheet -> Dataframe
- Column -> Series
- Row Heading or number -> Index
- Empty cell -> NaN


## 1. Creating a dataframe from an array

### 1.1 Option 1


In [68]:
import pandas as pd
import numpy as np

In [69]:
# creating a array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# creating a dataframe form the array using pandas
df=pd.DataFrame(data,columns=['col1','col2','col3'],index=['row1','row2','row3']) 

# printing the dataframe
df

Unnamed: 0,col1,col2,col3
row1,1,2,3
row2,4,5,6
row3,7,8,9


### 1.2 option 2(Creating a array with list)


In [70]:
# creating a array with list
my_list=[['Sutapa',19,'Me'],
           ['Sandipan',15,'Brother'],
           ['Prity',20,'Friend']]
# creating a dataframe from the list using pandas
df=pd.DataFrame(my_list,columns=['Name','Age','Relation'],index=['A.','B.','C.'])
df

Unnamed: 0,Name,Age,Relation
A.,Sutapa,19,Me
B.,Sandipan,15,Brother
C.,Prity,20,Friend


## 2. By using a Dictionary


In [71]:
# list of the dictionaries
states=['West Bengal','Assam','Bihar','Sikkim']
Population=[10000000,2000000,30000000,25000]
# creating a dictionary out of the list
my_dict={'states':states,'Population':Population}

# creating a dataframe from the dictionary using pandas
df=pd.DataFrame(my_dict,columns=['states','Population'])
df

Unnamed: 0,states,Population
0,West Bengal,10000000
1,Assam,2000000
2,Bihar,30000000
3,Sikkim,25000


## 3. Creating a Dataframe from a CSV file


In [72]:
# reading a csv file
df=pd.read_csv('C:/Users/Lenovo/Downloads/StudentsPerformance.csv')
df.head() # displaying the first 5 rows of the dataframe
df.tail() # displaying the last 5 rows of the dataframe
df.tail(2) # displaying the last 10 rows of the dataframe

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
998,female,group D,some college,standard,completed,68,78,77
999,female,group D,some college,free/reduced,none,77,86,86


# 1.DataFrames and Its properties


In [73]:
import pandas as pd
# reading a csv file
df=pd.read_csv('C:/Users/Lenovo/Downloads/StudentsPerformance.csv')

### 1. Attributes


In [74]:
df.shape # displaying the shape of the dataframe (rows, columns) No of rows and columns

(1000, 8)

In [75]:
df.index # displaying the index of the dataframe

RangeIndex(start=0, stop=1000, step=1)

In [76]:
df.columns # displaying the columns of the dataframe

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

In [77]:
df.dtypes # displaying the data types of the columns of the dataframe

gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

### 2. Methods


In [78]:
df.info() # displaying the information about the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [79]:
df.describe() # displaying the statistical information about the dataframe

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


### 3. Functions


In [80]:
# Obtaining the length of the dataframe
len(df) # returns the number of rows in the dataframe

1000

In [81]:
# obtainting the number of columns in the dataframe
len(df.columns) # returns the number of columns in the dataframe

8

In [82]:
# obtaining the lowest/first column of the dataframe
min(df.columns) # returns the lowest column of the dataframe

'gender'

In [83]:
# obtaining the last column of the dataframe
max(df.columns) # returns the last column of the dataframe

'writing score'

In [84]:
#obtaining the highest index of the dataframe
max(df.index) # returns the highest index of the dataframe

999

In [85]:
# Rounding the values in the dataframe
df.round(3) # rounding the values in the dataframe to 3 decimal places

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


## 1. Working With columns in DataFrame


In [86]:
import pandas as pd
# reading a csv file
df=pd.read_csv('C:/Users/Lenovo/Downloads/StudentsPerformance.csv')

### 1. Selecting One column

#### 1.1 Syntax 1 (Recommended [])


In [87]:
# Selecting a column with []
df['gender'] # gives a column which is like 1d arrays called series

0      female
1      female
2      female
3        male
4        male
        ...  
995    female
996      male
997    female
998    female
999    female
Name: gender, Length: 1000, dtype: object

In [88]:
# checking its data type
type(df['gender']) 

pandas.core.series.Series

In [89]:
# Series: Attributes and Methods
df['gender'].index
df['gender'].head() 

0    female
1    female
2    female
3      male
4      male
Name: gender, dtype: object

#### 1.2 Syntax 2


In [90]:
# Selecting a column with dot notation
df.gender

0      female
1      female
2      female
3        male
4        male
        ...  
995    female
996      male
997    female
998    female
999    female
Name: gender, Length: 1000, dtype: object

In [91]:
# Pitfalls of using dot notation
#df.math score # gives an error as it does not allow spaces in the column name
df['math score'] # this works as it is a valid column name with spaces

0      72
1      69
2      90
3      47
4      76
       ..
995    88
996    62
997    59
998    68
999    77
Name: math score, Length: 1000, dtype: int64

### 2. Selecting two or more columns


In [92]:
# selecting 2 columns using[[]]
df[['gender','math score']] # gives a dataframe with 2 columns

Unnamed: 0,gender,math score
0,female,72
1,female,69
2,female,90
3,male,47
4,male,76
...,...,...
995,female,88
996,male,62
997,female,59
998,female,68


In [93]:
# checking out the data type of the two columns
df[['gender','math score']].dtypes
# datatype of the selection
type(df[['gender','math score']]) # gives a dataframe

pandas.core.frame.DataFrame

### 3. Adding a new Column

#### 1.1 Adding a Column with a scalar value


In [94]:
# adding a new column to dataframe
df['Language Score']= 70
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Language Score
0,female,group B,bachelor's degree,standard,none,72,72,74,70
1,female,group C,some college,standard,completed,69,90,88,70
2,female,group B,master's degree,standard,none,90,95,93,70
3,male,group A,associate's degree,free/reduced,none,47,57,44,70
4,male,group C,some college,standard,none,76,78,75,70


#### 1.2 Adding a new Column with an array


In [95]:
# The data frame has 1000 elements so we need to create an array of 1000 elements
# import numpy 
import numpy as np
# creating an array of 1000 elements
language_score=np.arange(0, 1000) # 0 to 999
df['Language Score']=language_score
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Language Score
0,female,group B,bachelor's degree,standard,none,72,72,74,0
1,female,group C,some college,standard,completed,69,90,88,1
2,female,group B,master's degree,standard,none,90,95,93,2
3,male,group A,associate's degree,free/reduced,none,47,57,44,3
4,male,group C,some college,standard,none,76,78,75,4
...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,995
996,male,group C,high school,free/reduced,none,62,55,55,996
997,female,group C,high school,free/reduced,completed,59,71,65,997
998,female,group D,some college,standard,completed,68,78,77,998


In [96]:
# creating a random no between 50 to 100
language_score=np.random.randint(50,100,size=1000)
# for creating a array with float value
np.random.uniform(50,100,size=1000)
df['Language Score']=language_score
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,Language Score
0,female,group B,bachelor's degree,standard,none,72,72,74,56
1,female,group C,some college,standard,completed,69,90,88,67
2,female,group B,master's degree,standard,none,90,95,93,73
3,male,group A,associate's degree,free/reduced,none,47,57,44,78
4,male,group C,some college,standard,none,76,78,75,66
...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,95
996,male,group C,high school,free/reduced,none,62,55,55,61
997,female,group C,high school,free/reduced,completed,59,71,65,98
998,female,group D,some college,standard,completed,68,78,77,72


### 4. Adding multiple Column with assign and insert

#### 1.1 assign()


In [97]:
import numpy as np
score1=np.random.uniform(50,100,size=1000)
score2=np.random.randint(35,80,size=1000)
# creating a series
series1=pd.Series(score1,name='Science Score')
series2=pd.Series(score2,name='History Score')  

# Using assign() to add multiple columns
# df=df.assign(Science=series1,History=series2)

# pitfalls: cant insert the series at a particular Index
# also the name cant contain spaces


#### 1.2 insert() recommended


In [98]:
# insert() method to add multiple columns
df.insert(5,'Science Score',series1)
df.insert(8,'History Score',series2)
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,Science Score,math score,reading score,History Score,writing score,Language Score
0,female,group B,bachelor's degree,standard,none,59.003603,72,72,75,74,56
1,female,group C,some college,standard,completed,97.874311,69,90,63,88,67
2,female,group B,master's degree,standard,none,80.067330,90,95,79,93,73
3,male,group A,associate's degree,free/reduced,none,92.416570,47,57,73,44,78
4,male,group C,some college,standard,none,93.532291,76,78,53,75,66
...,...,...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,90.298382,88,99,46,95,95
996,male,group C,high school,free/reduced,none,81.126728,62,55,42,55,61
997,female,group C,high school,free/reduced,completed,87.376368,59,71,70,65,98
998,female,group D,some college,standard,completed,85.346046,68,78,54,77,72


# 2. Math Operation
