**Pandas** is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.

**Installation of Pandas**

In [4]:
!pip install pandas



**Import Pandas**

In [5]:
import pandas as pd #importing pandas and providing it with an alias


**Series**

A **Pandas Series** is like a column in a table.

It is a *one-dimensional array* holding data of any type.

In [6]:
a = [1, 7, 2, 4, 9, 8]

myNum = pd.Series(a)

print(myNum)

0    1
1    7
2    2
3    4
4    9
5    8
dtype: int64


In [7]:
print(myNum[0])
print(myNum[5])

1
8


**Labels**

With the *index* argument, you can name your own labels.

In [8]:
a = [10, 17, 21]

myNum = pd.Series(a, index = ["a", "b", "c"])

print(myNum)

a    10
b    17
c    21
dtype: int64


In [9]:
print(myNum["a"])

10


**Key/Value** Objects as Series

In [10]:
running = {"day1": 2, "day2": 3, "day3": 5}

myRun = pd.Series(running)

print(myRun)

day1    2
day2    3
day3    5
dtype: int64


**DataFrames**

Data sets in Pandas are usually *multi-dimensional* tables, called **DataFrames**.

Series is like a column, a DataFrame is the whole table.

In [11]:
data = {
  "kilometers": [4, 3, 5],
  "duration": [50, 40, 45]
}

myRun = pd.DataFrame(data)

print(myRun)

   kilometers  duration
0           4        50
1           3        40
2           5        45


In [12]:
print(myRun.info())       # Overview of the dataset
print(myRun.describe())   # Summary statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   kilometers  3 non-null      int64
 1   duration    3 non-null      int64
dtypes: int64(2)
memory usage: 176.0 bytes
None
       kilometers  duration
count         3.0       3.0
mean          4.0      45.0
std           1.0       5.0
min           3.0      40.0
25%           3.5      42.5
50%           4.0      45.0
75%           4.5      47.5
max           5.0      50.0


In [13]:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


In [15]:
from google.colab import drive

In [17]:
drive.mount("/content/drive")

Mounted at /content/drive


**Loading Data from a File**

In [18]:
mydf = pd.read_csv("/content/drive/MyDrive/Course Work/Sem 4/Data Analysis and Visualization/Lab 3/names.csv")

Download CSV - [names.csv](https://github.com/gagan-iitb/DataAnalyticsAndVisualization/blob/main/Lab-W25/dataset/names.csv)

In [19]:
print(mydf.head())  # Display the first 5 rows

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    James   23
4     John   26


In [20]:
print(mydf.head(7))  # Display the first 7 rows

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    James   23
4     John   26
5  William   28
6    Caleb   25


In [21]:
print(mydf['Name'])  # Single column

0      Alice
1        Bob
2    Charlie
3      James
4       John
5    William
6      Caleb
7      Helen
Name: Name, dtype: object


In [22]:
print(mydf[['Age', 'Name']])  # Multiple columns

   Age     Name
0   25    Alice
1   30      Bob
2   35  Charlie
3   23    James
4   26     John
5   28  William
6   25    Caleb
7   30    Helen


Filtering Rows

In [23]:
print(mydf[mydf['Age'] > 25])

      Name  Age
1      Bob   30
2  Charlie   35
4     John   26
5  William   28
7    Helen   30


Adding/Updating Columns

In [24]:
mydf['Salary'] = [50000, 60000, 50000, 50000, 30000, 70000, 90000, 80000]
print(mydf)

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   50000
3    James   23   50000
4     John   26   30000
5  William   28   70000
6    Caleb   25   90000
7    Helen   30   80000


**Saving to a File**

In [25]:
mydf.to_csv('myDataframe.csv', index=False)

Dropping Columns

In [26]:
mydf = mydf.drop('Salary', axis=1)  # Drop column

In [27]:
print(mydf)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    James   23
4     John   26
5  William   28
6    Caleb   25
7    Helen   30


In [28]:
#Create/Append two new columns named Marks, Department in mydf and display it

mydf['Marks'] = [85, 92, 78, 88, 95, 75, 80, 90]
mydf['Department'] = ['CSE', 'ECE', 'MECH', 'CSE', 'ECE', 'MECH', 'CSE', 'ECE']
mydf

Unnamed: 0,Name,Age,Marks,Department
0,Alice,25,85,CSE
1,Bob,30,92,ECE
2,Charlie,35,78,MECH
3,James,23,88,CSE
4,John,26,95,ECE
5,William,28,75,MECH
6,Caleb,25,80,CSE
7,Helen,30,90,ECE


In [29]:
#Save the newly create mydf to a csv file. (Name of file = myDataframe_YourIDNumber.csv)

In [30]:

YourIDNumber = "YourIDNumber"

mydf.to_csv(f'myDataframe_{YourIDNumber}.csv', index=False)

In [31]:
#Filter all the rows where Age falls between 25-30.

In [32]:

filtered_df = mydf[(mydf['Age'] >= 25) & (mydf['Age'] <= 30)]
filtered_df

Unnamed: 0,Name,Age,Marks,Department
0,Alice,25,85,CSE
1,Bob,30,92,ECE
4,John,26,95,ECE
5,William,28,75,MECH
6,Caleb,25,80,CSE
7,Helen,30,90,ECE


Unique() function

In [33]:
mydf.Age.unique()

array([25, 30, 35, 23, 26, 28])

Sorting

In [34]:
mydf.sort_values(by=['Age'])

Unnamed: 0,Name,Age,Marks,Department
3,James,23,88,CSE
0,Alice,25,85,CSE
6,Caleb,25,80,CSE
4,John,26,95,ECE
5,William,28,75,MECH
1,Bob,30,92,ECE
7,Helen,30,90,ECE
2,Charlie,35,78,MECH


In [35]:
#Sort mydf dataframe on the basis of Name,Marks.

In [36]:


mydf.sort_values(by=['Name', 'Marks'])


Unnamed: 0,Name,Age,Marks,Department
0,Alice,25,85,CSE
1,Bob,30,92,ECE
6,Caleb,25,80,CSE
2,Charlie,35,78,MECH
7,Helen,30,90,ECE
3,James,23,88,CSE
4,John,26,95,ECE
5,William,28,75,MECH


Missing Data

In [37]:
data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace", "Hank"],
    "Gender": ["Female", "Male", None, "Female", None, "Male", "Female", None],
}
df = pd.DataFrame(data)
print(df)

      Name  Gender
0    Alice  Female
1      Bob    Male
2  Charlie    None
3    Diana  Female
4      Eve    None
5    Frank    Male
6    Grace  Female
7     Hank    None


In [38]:
print("\nCheck for missing values:")
print(pd.isnull(df))


Check for missing values:
    Name  Gender
0  False   False
1  False   False
2  False    True
3  False   False
4  False    True
5  False   False
6  False   False
7  False    True


In [39]:
print("\nCheck for missing values(Column):")
print(pd.isnull(df['Gender']))


Check for missing values(Column):
0    False
1    False
2     True
3    False
4     True
5    False
6    False
7     True
Name: Gender, dtype: bool


In [40]:
# Fill missing values in the 'Gender' column with a default value
# Assuming the original data is still needed, redefine 'df' before using it.
data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace", "Hank"],
    "Gender": ["Female", "Male", None, "Female", None, "Male", "Female", None],
}
df = pd.DataFrame(data)  # Recreate the DataFrame

# Fill missing values in the 'Gender' column with a default value
df['Gender'] = df['Gender'].fillna("Not Specified")
print(df)

      Name         Gender
0    Alice         Female
1      Bob           Male
2  Charlie  Not Specified
3    Diana         Female
4      Eve  Not Specified
5    Frank           Male
6    Grace         Female
7     Hank  Not Specified


In [41]:
#updated dataframe
print(df)

      Name         Gender
0    Alice         Female
1      Bob           Male
2  Charlie  Not Specified
3    Diana         Female
4      Eve  Not Specified
5    Frank           Male
6    Grace         Female
7     Hank  Not Specified


In [42]:
#Read myStudentDataFrame.csv

In [45]:
import pandas as pd

YourIDNumber = "YourIDNumber"
file_name = f'myDataframe_{YourIDNumber}.csv'

try:
    my_student_df = pd.read_csv(file_name)
    print(my_student_df.head())
except FileNotFoundError:
    print(f"Error: {file_name} not found. Please ensure the file exists in the correct location.")
except pd.errors.ParserError:
    print(f"Error: Could not parse {file_name}. Please check the file format.")
except Exception as e:
    print(f"An unexpected error")

      Name  Age  Marks Department
0    Alice   25     85        CSE
1      Bob   30     92        ECE
2  Charlie   35     78       MECH
3    James   23     88        CSE
4     John   26     95        ECE


In [44]:
#Check for missing data in all columns using appropriate pandas functions.

In [46]:

import pandas as pd
YourIDNumber = "YourIDNumber"
file_name = f'myDataframe_{YourIDNumber}.csv'
try:
    my_student_df = pd.read_csv(file_name)

    missing_data_summary = my_student_df.isnull().sum()
    print("Missing data summary:\n", missing_data_summary)

except FileNotFoundError:
    print(f"Error: {file_name} not found. Please ensure the file exists in the correct location.")
except pd.errors.ParserError:
    print(f"Error: Could not parse {file_name}. Please check the file format.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Missing data summary:
 Name          0
Age           0
Marks         0
Department    0
dtype: int64


In [47]:
#Drop Rows with Missing Data

In [48]:

import pandas as pd
YourIDNumber = "YourIDNumber"
file_name = f'myDataframe_{YourIDNumber}.csv'
try:
    my_student_df = pd.read_csv(file_name)
    df_dropped = my_student_df.dropna()
    print(df_dropped)

except FileNotFoundError:
    print(f"Error: {file_name} not found. Please ensure the file exists in the correct location.")
except pd.errors.ParserError:
    print(f"Error: Could not parse {file_name}. Please check the file format.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

      Name  Age  Marks Department
0    Alice   25     85        CSE
1      Bob   30     92        ECE
2  Charlie   35     78       MECH
3    James   23     88        CSE
4     John   26     95        ECE
5  William   28     75       MECH
6    Caleb   25     80        CSE
7    Helen   30     90        ECE


In [49]:
#Compute Summary Statistics (AVG,MEAN,MAX,MIN)

In [50]:

import pandas as pd
YourIDNumber = "YourIDNumber"
file_name = f'myDataframe_{YourIDNumber}.csv'
try:
    my_student_df = pd.read_csv(file_name)
    numerical_stats = my_student_df.describe()
    print(numerical_stats)

except FileNotFoundError:
    print(f"Error: {file_name} not found.")
except pd.errors.ParserError:
    print(f"Error: Could not parse {file_name}.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

            Age      Marks
count   8.00000   8.000000
mean   27.75000  85.375000
std     3.84522   7.130167
min    23.00000  75.000000
25%    25.00000  79.500000
50%    27.00000  86.500000
75%    30.00000  90.500000
max    35.00000  95.000000


In [51]:
#Filter Data and Compute Pass/Fail
#mark >= 40: Pass
#mark < 40: Fail

In [52]:

import pandas as pd
YourIDNumber = "YourIDNumber"
file_name = f'myDataframe_{YourIDNumber}.csv'
try:
    my_student_df = pd.read_csv(file_name)
    my_student_df['Result'] = my_student_df['Marks'].apply(lambda mark: 'Pass' if mark >= 40 else 'Fail')
    print(my_student_df)
except FileNotFoundError:
    print(f"Error: {file_name} not found. Please ensure the file exists in the correct location.")
except pd.errors.ParserError:
    print(f"Error: Could not parse {file_name}. Please check the file format.")
except KeyError:
    print(f"Error: 'Marks' column not found in {file_name}.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

      Name  Age  Marks Department Result
0    Alice   25     85        CSE   Pass
1      Bob   30     92        ECE   Pass
2  Charlie   35     78       MECH   Pass
3    James   23     88        CSE   Pass
4     John   26     95        ECE   Pass
5  William   28     75       MECH   Pass
6    Caleb   25     80        CSE   Pass
7    Helen   30     90        ECE   Pass


In [53]:
#Add a new column Result to the DataFrame indicating Pass or Fail.

In [54]:

YourIDNumber = "YourIDNumber"
file_name = f'myDataframe_{YourIDNumber}.csv'

try:
    my_student_df = pd.read_csv(file_name)
    my_student_df['Result'] = my_student_df['Marks'].apply(lambda mark: 'Pass' if mark >= 40 else 'Fail')
    print(my_student_df)
except FileNotFoundError:
    print(f"Error: {file_name} not found. Please ensure the file exists in the correct location.")
except pd.errors.ParserError:
    print(f"Error: Could not parse {file_name}. Please check the file format.")
except KeyError:
    print(f"Error: 'Marks' column not found in {file_name}.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

      Name  Age  Marks Department Result
0    Alice   25     85        CSE   Pass
1      Bob   30     92        ECE   Pass
2  Charlie   35     78       MECH   Pass
3    James   23     88        CSE   Pass
4     John   26     95        ECE   Pass
5  William   28     75       MECH   Pass
6    Caleb   25     80        CSE   Pass
7    Helen   30     90        ECE   Pass


In [55]:
#Save the Final DataFrame
#Save the updated DataFrame (with the Result column) to a new CSV file named Result_YourIDNumber.csv.

In [56]:

YourIDNumber = "YourIDNumber"
file_name = f'myDataframe_{YourIDNumber}.csv'

try:
    my_student_df = pd.read_csv(file_name)
    my_student_df['Result'] = my_student_df['Marks'].apply(lambda mark: 'Pass' if mark >= 40 else 'Fail')
    my_student_df.to_csv(f'Result_{YourIDNumber}.csv', index=False)
    print(f"DataFrame saved to Result_{YourIDNumber}.csv")
except FileNotFoundError:
    print(f"Error: {file_name} not found. Please ensure the file exists in the correct location.")
except pd.errors.ParserError:
    print(f"Error: Could not parse {file_name}. Please check the file format.")
except KeyError:
    print(f"Error: 'Marks' column not found in {file_name}.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

DataFrame saved to Result_YourIDNumber.csv


Additional Practice Questions - [Click Here](https://colab.research.google.com/drive/1_Hc9yV2RIgvau6BsLiYXn87orXqa52RX?usp=sharing)