# Pandas Learnings

Pandas helps explore, clean and process tabular data. A table in pandas is called a DataFrame. 
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.

Importing Pandas

In [1]:
import pandas as pd

Creating a dataframe and printing it

In [2]:
df=pd.DataFrame({"Name":["Mr. A", "Mr. B", "Mr. C"],"Age":[20,30,40],"Grade":["A","B","F"]})
df

Unnamed: 0,Name,Age,Grade
0,Mr. A,20,A
1,Mr. B,30,B
2,Mr. C,40,F


Each column in a DataFrame is called a Series

In [3]:
df["Name"]

0    Mr. A
1    Mr. B
2    Mr. C
Name: Name, dtype: object

In [4]:
df["Age"]

0    20
1    30
2    40
Name: Age, dtype: int64

In [5]:
df["Grade"]

0    A
1    B
2    F
Name: Grade, dtype: object

We can create a Series in code as well. 
A pandas Series has no column labels, as it is just a single column of a DataFrame. 
A Series does have row labels.

In [6]:
totalmarks=pd.Series([100,80,20],name="Total Marks")
totalmarks

0    100
1     80
2     20
Name: Total Marks, dtype: int64

Using functions max() and describe()
The describe() method provides a quick overview of the numerical data in a DataFrame. It ignores textual data. These are by default not taken into account by the describe() method. It returns a Pandas Series

In [7]:
df["Age"].max()

40

In [8]:
df.describe()

Unnamed: 0,Age
count,3.0
mean,30.0
std,10.0
min,20.0
25%,25.0
50%,30.0
75%,35.0
max,40.0


### CSV File reading
read_csv() method is used to read a csv file. Other methods will be of the form read_* where * can be csv, excel, sql, json, parquet, etc.

In [9]:
titanic = pd.read_csv("titanic.csv")

In [10]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


To see the first N rows of a DataFrame, use the head() method with the required number of rows as argument.
To see the last N rows of a DataFrame, use the tail() method with the required number of rows as argument.

In [11]:
titanic.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
titanic.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


To see the data types use dtypes

In [13]:
titanic.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

To store data in a file we can use the to_* methods.
Ex: to_excel()

The to_excel() method stores the data as an excel file. Below, the sheet_name is named passengers instead of the default Sheet1. By setting index=False the row index labels are not saved in the spreadsheet.

In [14]:
titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")

ModuleNotFoundError: No module named 'openpyxl'

For technical info use info() method

In [15]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   Grade   3 non-null      object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes


Subset of a DataFrame
Each column in a DataFrame is a Series. As a single column is selected, the returned object is a pandas Series.


In [17]:
ages = titanic["Age"]
ages.head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [18]:
type(titanic["Age"])

pandas.core.series.Series

In [19]:
titanic["Age"].shape

(891,)

DataFrame.shape is an attribute (do not use parentheses for attributes) of a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned

In [20]:
age_sex = titanic[["Age", "Sex"]]
age_sex

Unnamed: 0,Age,Sex
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male
...,...,...
886,27.0,male
887,19.0,female
888,,female
889,26.0,male


The inner square brackets define a Python list with column names, whereas the outer brackets are used to select the data from a pandas DataFrame as seen in the previous example.