- Pandas – Data Analysis Library
Why Learn Pandas?\
What is it?\
Pandas is a Python library for data manipulation and analysis. It provides Series (1D) and DataFrames (2D tabular data).

Why we need it?
1. Simplifies data cleaning, exploration, and manipulation.
2. Handles large datasets efficiently.
3. Built on top of NumPy for fast computations.


Where is it used?
1. Data Analysis & Reporting
2. Machine Learning (preprocessing datasets)
3. Finance, Healthcare, Marketing data analysis
4. CSV/Excel/SQL data handling
 

In [2]:
import pandas as pd

3.Series & DataFrames


In [3]:
# Series 
s = pd.Series([10, 20, 30, 40]) 
print("Series:\n", s) 
# DataFrame from dict 
data = {'Name': ['Alice','Bob','Charlie'], 'Age':[25,30,35]} 
df = pd.DataFrame(data) 
print("\nDataFrame:\n", df)
 

Series:
 0    10
1    20
2    30
3    40
dtype: int64

DataFrame:
       Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


- 4. Reading and exploring the data

In [14]:
# Read CSV (replace with your file path)
df = pd.read_csv("most_runs_average_strikerate.csv")
 
# Explore
print(df.head())
print("\n")
print(df.tail())
print("\n")
print(df.info())
print("\n")


print(df.describe())
print("\n")

print(df.shape)

     batsman  total_runs  out  numberofballs    average  strikerate
0    V Kohli        5426  152           4111  35.697368  131.987351
1   SK Raina        5386  160           3916  33.662500  137.538304
2  RG Sharma        4902  161           3742  30.447205  130.999466
3  DA Warner        4717  114           3292  41.377193  143.286756
4   S Dhawan        4601  137           3665  33.583942  125.538881


            batsman  total_runs  out  numberofballs  average  strikerate
511        ND Doshi           0    1             13      0.0         0.0
512         J Denly           0    1              1      0.0         0.0
513         S Ladda           0    2              9      0.0         0.0
514  V Pratap Singh           0    1              1      0.0         0.0
515       S Kaushik           0    1              1      0.0         0.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 516 entries, 0 to 515
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
--

In [17]:
# Select column
print(df['batsman'])
print("\n")

# Select multiple columns
print(df[['batsman','total_runs']])
print("\n")

# Select row by index
print(df.iloc[0])       # first row
print(df.loc[0])        # first row using label
print("\n")

# Select subset of rows and columns
print(df.loc[0:1, ['batsman','out']])

0             V Kohli
1            SK Raina
2           RG Sharma
3           DA Warner
4            S Dhawan
            ...      
511          ND Doshi
512           J Denly
513           S Ladda
514    V Pratap Singh
515         S Kaushik
Name: batsman, Length: 516, dtype: object


            batsman  total_runs
0           V Kohli        5426
1          SK Raina        5386
2         RG Sharma        4902
3         DA Warner        4717
4          S Dhawan        4601
..              ...         ...
511        ND Doshi           0
512         J Denly           0
513         S Ladda           0
514  V Pratap Singh           0
515       S Kaushik           0

[516 rows x 2 columns]


batsman             V Kohli
total_runs             5426
out                     152
numberofballs          4111
average           35.697368
strikerate       131.987351
Name: 0, dtype: object
batsman             V Kohli
total_runs             5426
out                     152
numberofballs          4111
a

In [18]:
# Filter rows
print(df[df['out'] > 28])
 
# Sort by Age
print(df.sort_values('out', ascending=False))

       batsman  total_runs  out  numberofballs    average  strikerate
0      V Kohli        5426  152           4111  35.697368  131.987351
1     SK Raina        5386  160           3916  33.662500  137.538304
2    RG Sharma        4902  161           3742  30.447205  130.999466
3    DA Warner        4717  114           3292  41.377193  143.286756
4     S Dhawan        4601  137           3665  33.583942  125.538881
..         ...         ...  ...            ...        ...         ...
118  LR Shukla         405   29            346  13.965517  117.052023
123   R Ashwin         376   33            336  11.393939  111.904762
126   A Mishra         366   29            385  12.620690   95.064935
128   R Bhatia         342   29            284  11.793103  120.422535
129    P Kumar         340   34            314  10.000000  108.280255

[97 rows x 6 columns]
              batsman  total_runs  out  numberofballs    average  strikerate
2           RG Sharma        4902  161           3742  30.44

In [19]:
# Add new column
df['total_runs_add5'] = df['total_runs'] + 5
print(df)
print("\n")
 
# Drop column
df = df.drop('total_runs_add5', axis=1)
print(df)
print("\n")

 
# 8. GroupBy & Aggregation
data = {'Name':['Alice','Bob','Charlie','Alice','Bob'],
        'Score':[85,90,95,80,70]}
df = pd.DataFrame(data)
 
# Group by Name and calculate mean score
grouped = df.groupby('Name').mean()
print(grouped)
 

            batsman  total_runs  out  numberofballs    average  strikerate  \
0           V Kohli        5426  152           4111  35.697368  131.987351   
1          SK Raina        5386  160           3916  33.662500  137.538304   
2         RG Sharma        4902  161           3742  30.447205  130.999466   
3         DA Warner        4717  114           3292  41.377193  143.286756   
4          S Dhawan        4601  137           3665  33.583942  125.538881   
..              ...         ...  ...            ...        ...         ...   
511        ND Doshi           0    1             13   0.000000    0.000000   
512         J Denly           0    1              1   0.000000    0.000000   
513         S Ladda           0    2              9   0.000000    0.000000   
514  V Pratap Singh           0    1              1   0.000000    0.000000   
515       S Kaushik           0    1              1   0.000000    0.000000   

     total_runs_add5  
0               5431  
1               5

9. Handling Missing Data

In [21]:
data = {'Name':['Alice','Bob','Charlie','David'],
        'Age':[25, None, 35, 40]}
df = pd.DataFrame(data)
 
# Fill missing value
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)
 
# Drop rows with missing values
# df = df.dropna()

      Name        Age
0    Alice  25.000000
1      Bob  33.333333
2  Charlie  35.000000
3    David  40.000000
