## Pandas - working with CSV file

In [1]:
import pandas as pd
import numpy as np

### Problem statement
- isin()
- Find the most destructive death over batsman in the history of IPL 
- strike Rate =(Number of runs/number of balls)/100
- min batsman 200 balls in over 16-20

In this Pandas tutorial, we explore the 'isin()' function to find the most destructive death over batsman in the history of IPL. We calculate the strike rate of each batsman (runs per ball) and filter the players who faced a minimum of 200 balls in overs 16 to 20. By analyzing this data, we can identify the most impactful batsman during the crucial final overs of an IPL match.

In [2]:
delivery = pd.read_csv('deliveries.csv')

In [3]:
delivery.sample(5)

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,...,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
68330,289,1,Chennai Super Kings,Delhi Daredevils,18,4,S Badrinath,MS Dhoni,VR Aaron,0,...,0,0,0,0,1,0,1,,,
130380,550,2,Royal Challengers Bangalore,Kolkata Knight Riders,2,4,V Kohli,CH Gayle,UT Yadav,0,...,0,0,0,0,0,0,0,,,
12886,55,2,Delhi Daredevils,Royal Challengers Bangalore,2,5,KK Nair,SS Iyer,YS Chahal,0,...,0,0,0,0,0,0,0,,,
146161,618,1,Sunrisers Hyderabad,Delhi Daredevils,9,6,KS Williamson,S Dhawan,J Yadav,0,...,0,0,0,0,1,0,1,,,
96690,408,2,Delhi Daredevils,Mumbai Indians,11,3,V Sehwag,DPMD Jayawardene,SL Malinga,0,...,0,0,0,0,1,0,1,,,


In [16]:
# min batsman 200 balls in over 16-20

delivery['over']>15

# add in delivery 
# delivery[delivery['over']>15]

# we can also do this step with mask
mask_over = delivery['over']>15
delivery2 = delivery[mask_over]
delivery2.sample(2)

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,...,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
108983,460,1,Chennai Super Kings,Kings XI Punjab,16,1,SK Raina,DR Smith,MG Johnson,0,...,0,0,0,0,1,0,1,,,
148627,629,1,Kings XI Punjab,Rising Pune Supergiants,19,2,AR Patel,F Behardien,AB Dinda,0,...,0,0,0,0,1,0,1,,,


In [27]:
# short method
# delivery2[delivery2['over']>15]

# with mask
over_greater_then_15 = delivery2['over']>15
delivery2[over_greater_then_15]
delivery2 = delivery2[over_greater_then_15]
delivery2.sample(2)

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,...,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
143923,608,2,Kings XI Punjab,Kolkata Knight Riders,18,6,AR Patel,Gurkeerat Singh,AD Russell,0,...,0,0,0,0,6,0,6,,,
123314,520,1,Rajasthan Royals,Kings XI Punjab,19,4,JP Faulkner,CH Morris,MG Johnson,0,...,0,0,0,0,4,0,4,,,


In [29]:
# data getting from all dataset - delivery
delivery.groupby('batsman')['batsman_runs'].count()

# data comming from all the columns where over is greater then 15
delivery2.groupby('batsman')['batsman_runs'].count()

batsman
A Ashish Reddy    148
A Chandila          7
A Chopra            2
A Choudhary        20
A Flintoff         18
                 ... 
YS Chahal          27
YV Takawale        13
Yashpal Singh      13
Yuvraj Singh      516
Z Khan            109
Name: batsman_runs, Length: 416, dtype: int64

In [38]:
# stored in a variable
all_batsman = delivery2.groupby('batsman')['batsman_runs'].count()

x = all_batsman>200

# mask - result -> all runs greater then 200
all_batsman[x]

batsman
A Mishra             225
AB de Villiers       570
AD Mathews           289
AM Rahane            268
AR Patel             229
AT Rayudu            425
BJ Hodge             385
DA Miller            360
DA Warner            228
DJ Bravo             409
DJ Hussey            234
DPMD Jayawardene     246
Harbhajan Singh      418
IK Pathan            465
JA Morkel            425
JH Kallis            231
JP Duminy            518
JP Faulkner          294
KA Pollard           838
KD Karthik           463
KM Jadhav            338
LRPL Taylor          204
MK Pandey            224
MK Tiwary            423
MS Dhoni            1224
NV Ojha              304
P Kumar              268
PP Chawla            311
R Vinay Kumar        235
RA Jadeja            576
RG Sharma            748
RV Uthappa           275
S Badrinath          283
S Dhawan             243
SK Raina             458
SPD Smith            316
SS Tiwary            300
STR Binny            218
V Kohli              546
WP Saha          

In [39]:
all_batsman[x].shape

(43,)

In [43]:
# get indec from series, because we need just batsman names

all_batsman[x].index

# make a list
all_batsman[x].index.tolist()
# batsman_list - - data store in var, in the form of list
batsman_list = all_batsman[x].index.tolist()
batsman_list

['A Mishra',
 'AB de Villiers',
 'AD Mathews',
 'AM Rahane',
 'AR Patel',
 'AT Rayudu',
 'BJ Hodge',
 'DA Miller',
 'DA Warner',
 'DJ Bravo',
 'DJ Hussey',
 'DPMD Jayawardene',
 'Harbhajan Singh',
 'IK Pathan',
 'JA Morkel',
 'JH Kallis',
 'JP Duminy',
 'JP Faulkner',
 'KA Pollard',
 'KD Karthik',
 'KM Jadhav',
 'LRPL Taylor',
 'MK Pandey',
 'MK Tiwary',
 'MS Dhoni',
 'NV Ojha',
 'P Kumar',
 'PP Chawla',
 'R Vinay Kumar',
 'RA Jadeja',
 'RG Sharma',
 'RV Uthappa',
 'S Badrinath',
 'S Dhawan',
 'SK Raina',
 'SPD Smith',
 'SS Tiwary',
 'STR Binny',
 'V Kohli',
 'WP Saha',
 'Y Venugopal Rao',
 'YK Pathan',
 'Yuvraj Singh']

In [50]:
# strike Rate =(Number of runs/number of balls)/100
# Runs stored by all these 43 batsman
# Balls played by these 43 batsman

# We use isin function here - isin function details are given blow - after this problem.
delivery['batsman'].isin(batsman_list)

# store in delivery - also do this step with mask
# Storing the data in the variable 'delivery' and optionally we can implement a mask for this step.
delivery[delivery['batsman'].isin(batsman_list)].sample(2)

# delivery[delivery['batsman'].isin(batsman_list)].shape  - (66006, 21)

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,...,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
18742,80,2,Royal Challengers Bangalore,Deccan Chargers,1,1,JH Kallis,W Jaffer,WPUJC Vaas,0,...,0,0,0,0,0,0,0,,,
6104,26,1,Kings XI Punjab,Gujarat Lions,16,6,AR Patel,MP Stoinis,AJ Tye,0,...,0,0,0,0,4,0,4,,,


In [64]:
# all batsman in delivery2 are played 16-20 overs
final = delivery2[delivery2['batsman'].isin(batsman_list)]

# calculate runs - sum
runs = final.groupby('batsman')['batsman_runs'].sum()
runs.head(2)

batsman
A Mishra           227
AB de Villiers    1203
Name: batsman_runs, dtype: int64

In [65]:
# calculate all balls - use count()
balls = final.groupby('batsman')['batsman_runs'].count()
balls.head(2)

batsman
A Mishra          225
AB de Villiers    570
Name: batsman_runs, dtype: int64

In [60]:
sr = (runs/balls)*100
sr

batsman
A Mishra            100.888889
AB de Villiers      211.052632
AD Mathews          147.058824
AM Rahane           152.985075
AR Patel            142.794760
AT Rayudu           165.411765
BJ Hodge            157.402597
DA Miller           186.666667
DA Warner           189.473684
DJ Bravo            167.726161
DJ Hussey           175.213675
DPMD Jayawardene    152.032520
Harbhajan Singh     147.607656
IK Pathan           142.580645
JA Morkel           149.882353
JH Kallis           170.562771
JP Duminy           167.760618
JP Faulkner         149.319728
KA Pollard          161.336516
KD Karthik          152.051836
KM Jadhav           144.378698
LRPL Taylor         152.941176
MK Pandey           151.785714
MK Tiwary           140.189125
MS Dhoni            169.607843
NV Ojha             134.868421
P Kumar             109.701493
PP Chawla           120.257235
R Vinay Kumar       108.936170
RA Jadeja           130.729167
RG Sharma           175.668449
RV Uthappa          173.454545


# `isin()` function in pandas

The `isin()` function in pandas is used to check whether elements of a Series or DataFrame belong to a specific list of values. It returns a boolean mask that indicates whether each element is present in the specified list of values or not.

**Syntax:** 
For a Series:
```python
result = series.isin(values)
```

For a DataFrame:
```python
result = dataframe.isin(values)
```

**Parameters:**
- `values`: It can be a list, set, dictionary, or another Series or DataFrame. It represents the values that you want to check if they are present in the given Series or DataFrame.

**Returns:**
- `result`: It is a boolean Series or DataFrame that indicates whether each element is present in the 'values' or not. The elements that are present are marked as `True`, and the elements that are not present are marked as `False`.

**Example:**
```python
import pandas as pd

# Sample data
data = {'Name': ['John', 'Alice', 'Bob', 'Jane', 'Mike'],
        'Age': [25, 30, 22, 28, 35],
        'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']}

df = pd.DataFrame(data)

# Check if 'Age' column contains values 25, 28, and 30
result = df['Age'].isin([25, 28, 30])
print(result)
```

**Output:**
```
0     True
1     True
2    False
3     True
4    False
Name: Age, dtype: bool
```

In this example, the `isin()` function checks if the values 25, 28, and 30 are present in the 'Age' column of the DataFrame. The resulting boolean Series indicates that the first, second, and fourth elements are present (marked as `True`), while the third and fifth elements are not (marked as `False`).

In [66]:
data = {'Name': ['John', 'Alice', 'Bob', 'Jane', 'Mike'],
        'Age': [25, 30, 22, 28, 35],
        'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']}

df = pd.DataFrame(data)

# Check if 'Age' column contains values 25, 28, and 30
result = df['Age'].isin([25, 28, 30])
print(result)

0     True
1     True
2    False
3     True
4    False
Name: Age, dtype: bool
