**Pandas** is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.

**Installation of Pandas**

In [None]:
!pip install pandas



**Import Pandas**

In [None]:
import pandas as pd #importing pandas and providing it with an alias


**Series**

A **Pandas Series** is like a column in a table.

It is a *one-dimensional array* holding data of any type.

In [None]:
a = [1, 7, 2, 4, 9, 8]

myNum = pd.Series(a)

print(myNum)

0    1
1    7
2    2
3    4
4    9
5    8
dtype: int64


In [None]:
print(myNum[0])
print(myNum[5])

1
8


**Labels**

With the *index* argument, you can name your own labels.

In [None]:
a = [10, 17, 21]

myNum = pd.Series(a, index = ["a", "b", "c"])

print(myNum)

a    10
b    17
c    21
dtype: int64


In [None]:
print(myNum["a"])

10


**Key/Value** Objects as Series

In [None]:
running = {"day1": 2, "day2": 3, "day3": 5}

myRun = pd.Series(running)

print(myRun)

day1    2
day2    3
day3    5
dtype: int64


**DataFrames**

Data sets in Pandas are usually *multi-dimensional* tables, called **DataFrames**.

Series is like a column, a DataFrame is the whole table.

In [None]:
data = {
  "kilometers": [4, 3, 5],
  "duration": [50, 40, 45]
}

myRun = pd.DataFrame(data)

print(myRun)

   kilometers  duration
0           4        50
1           3        40
2           5        45


In [None]:
print(myRun.info())       # Overview of the dataset
print(myRun.describe())   # Summary statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   kilometers  3 non-null      int64
 1   duration    3 non-null      int64
dtypes: int64(2)
memory usage: 176.0 bytes
None
       kilometers  duration
count         3.0       3.0
mean          4.0      45.0
std           1.0       5.0
min           3.0      40.0
25%           3.5      42.5
50%           4.0      45.0
75%           4.5      47.5
max           5.0      50.0


In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


**Loading Data from a File**

In [None]:
mydf = pd.read_csv('/content/names.csv')

Download CSV - [names.csv](https://github.com/gagan-iitb/DataAnalyticsAndVisualization/blob/main/Lab-W25/dataset/names.csv)

In [None]:
print(mydf.head())  # Display the first 5 rows

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    James   23
4     John   26


In [None]:
print(mydf.head(7))  # Display the first 7 rows

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    James   23
4     John   26
5  William   28
6    Caleb   25


In [None]:
print(mydf['Name'])  # Single column

0      Alice
1        Bob
2    Charlie
3      James
4       John
5    William
6      Caleb
7      Helen
Name: Name, dtype: object


In [None]:
print(mydf[['Age', 'Name']])  # Multiple columns

   Age     Name
0   25    Alice
1   30      Bob
2   35  Charlie
3   23    James
4   26     John
5   28  William
6   25    Caleb
7   30    Helen


Filtering Rows

In [None]:
print(mydf[mydf['Age'] > 25])

      Name  Age
1      Bob   30
2  Charlie   35
4     John   26
5  William   28
7    Helen   30


Adding/Updating Columns

In [None]:
mydf['Salary'] = [50000, 60000, 50000, 50000, 30000, 70000, 90000, 80000]
print(mydf)

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   50000
3    James   23   50000
4     John   26   30000
5  William   28   70000
6    Caleb   25   90000
7    Helen   30   80000


**Saving to a File**

In [None]:
mydf.to_csv('myDataframe.csv', index=False)

Dropping Columns

In [None]:
mydf = mydf.drop('Salary', axis=1)  # Drop column

In [None]:
print(mydf)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    James   23
4     John   26
5  William   28
6    Caleb   25
7    Helen   30


In [None]:
#Create/Append two new columns named Marks, Department in mydf and display it

In [None]:
#Save the newly create mydf to a csv file. (Name of file = myDataframe_YourIDNumber.csv)

In [None]:
#Filter all the rows where Age falls between 25-30.

Unique() function

In [None]:
mydf.Age.unique()

array([25, 30, 35, 23, 26, 28])

Sorting

In [None]:
mydf.sort_values(by=['Age'])

Unnamed: 0,Name,Age
3,James,23
0,Alice,25
6,Caleb,25
4,John,26
5,William,28
1,Bob,30
7,Helen,30
2,Charlie,35


In [None]:
#Sort mydf dataframe on the basis of Name,Marks.

Missing Data

In [None]:
data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace", "Hank"],
    "Gender": ["Female", "Male", None, "Female", None, "Male", "Female", None],
}
df = pd.DataFrame(data)
print(df)

      Name  Gender
0    Alice  Female
1      Bob    Male
2  Charlie    None
3    Diana  Female
4      Eve    None
5    Frank    Male
6    Grace  Female
7     Hank    None


In [None]:
print("\nCheck for missing values:")
print(pd.isnull(df))


Check for missing values:
    Name  Gender
0  False   False
1  False   False
2  False    True
3  False   False
4  False    True
5  False   False
6  False   False
7  False    True


In [None]:
print("\nCheck for missing values(Column):")
print(pd.isnull(df['Gender']))


Check for missing values(Column):
0    False
1    False
2     True
3    False
4     True
5    False
6    False
7     True
Name: Gender, dtype: bool


In [None]:
# Fill missing values in the 'Gender' column with a default value
df['Gender'] = df['Gender'].fillna("Not Specified")

In [None]:
#updated dataframe
print(df)

      Name         Gender
0    Alice         Female
1      Bob           Male
2  Charlie  Not Specified
3    Diana         Female
4      Eve  Not Specified
5    Frank           Male
6    Grace         Female
7     Hank  Not Specified


In [None]:
#Read myStudentDataFrame.csv

In [None]:
#Check for missing data in all columns using appropriate pandas functions.

In [None]:
#Drop Rows with Missing Data

In [None]:
#Compute Summary Statistics (AVG,MEAN,MAX,MIN)

In [None]:
#Filter Data and Compute Pass/Fail
#mark >= 40: Pass
#mark < 40: Fail

In [None]:
#Add a new column Result to the DataFrame indicating Pass or Fail.

In [None]:
#Save the Final DataFrame
#Save the updated DataFrame (with the Result column) to a new CSV file named Result_YourIDNumber.csv.

Additional Practice Questions - [Click Here](https://colab.research.google.com/drive/1_Hc9yV2RIgvau6BsLiYXn87orXqa52RX?usp=sharing)