Pandas in Python Tutorial

What is Pandas?

In [1]:
# Open Source Library
# Design for Structured Data
# to clean, analyze, and visualize data effectively
# Two main data structures 1. Series 2. DataFrames

Installing Pandas

In [1]:
!pip install pandas



Introducing the Pandas Series

"A series is like a column in a spreadsheet or a 1D array in NumPy."

In [3]:
import pandas as pd

# creating a series
data = pd.Series([10,20,30,40])
print(data)

# accessing the element
print(data[3])

0    10
1    20
2    30
3    40
dtype: int64
40


Introducing the Pandas DataFrame

"Its a 2D table with rows and columns Just like an Excel sheet"

In [28]:
#creating a dataFrame
data = {'Name':["Jack","Peter Son","Bob"],'Age':[20,30,35],'Score':[85,78,65]}
df = pd.DataFrame(data)
print(df)

df.to_csv('/Users/mac/output_data1.csv')

        Name  Age  Score
0       Jack   20     85
1  Peter Son   30     78
2        Bob   35     65


exploring the DataFrame

In [9]:
# we need to view first few rows
print(df.head()) # head() function gets only top 5 rows

# getting the column names
print(df.columns)

#getting the data types
print(df.dtypes)

# check the missing values
print(df.isnull().sum())

        Name  Age  Score
0       Jack   20     85
1  Peter Son   30     78
2        Bob   35     65
Index(['Name', 'Age', 'Score'], dtype='object')
Name     object
Age       int64
Score     int64
dtype: object
Name     0
Age      0
Score    0
dtype: int64


Data Selection and Indexing

In [14]:
#selecting a column
print(df["Name"])

#select multiple column
print(df[["Name","Score"]])

#filtering rows

filtered_df = df[df['Age']==20]
print(filtered_df)

0         Jack
1    Peter Son
2          Bob
Name: Name, dtype: object
        Name  Score
0       Jack     85
1  Peter Son     78
2        Bob     65
   Name  Age  Score
0  Jack   20     85


Modifying data in DataFrames

In [17]:
#adding a new column
df["Pass"] = df['Score'] > 60
print(df)

#modify the existing column
df['Age'] = df["Age"] + 1
print(df)

#delete/drop a column
df = df.drop("Pass", axis=1)
print(df)

        Name  Age  Score  Pass
0       Jack   21     85  True
1  Peter Son   31     78  True
2        Bob   36     65  True
        Name  Age  Score  Pass
0       Jack   22     85  True
1  Peter Son   32     78  True
2        Bob   37     65  True
        Name  Age  Score
0       Jack   22     85
1  Peter Son   32     78
2        Bob   37     65


Handling Missing Data

In [19]:
#filtering the missing values
df['Age'] = df['Age'].fillna(0)
print(df)

#dropping rows with missing values
df = df.dropna()


        Name  Age  Score
0       Jack   22     85
1  Peter Son   32     78
2        Bob   37     65


Grouping and Aggeregating Data

In [20]:
#grouping data 
grouped = df.groupby('Age')['Score'].mean()
print(grouped)

Age
22    85.0
32    78.0
37    65.0
Name: Score, dtype: float64


Importing and Exporting Data

In [27]:
#read the file
df = pd.read_csv('/Users/mac/Salary_Data.csv')
print(df.head())

#writing a new file
df.to_csv('/Users/mac/output_data.csv', index=False)

   YearsExperience   Salary
0              1.1  39343.0
1              1.3  46205.0
2              1.5  37731.0
3              2.0  43525.0
4              2.2  39891.0


Question 1: What kind of data does Pandas handle?

Question 2: How do I read and write tabular data?

Question 3: How do I select a subset of a table?

Question 4: How to create plots in Pandas?

Question 5: How to create new columns derived from existing columns?

Question 6: How to calculate summary statistics?

Question 7: How to reshape the layout of tables?

Question 8: How to combine data from multiple tables?

Question 9: How to handle time series data?

Question 10: How to manipulate textual data?