## Pandas Crash Course for ML

### 1. File Reading and Writing
- **Reading and Writing CSV Files**: Learn how to load data from and save data to CSV files.
- **Exploring Other File Formats**: Understand how to handle different file formats like Excel and JSON.


In [1]:
import pandas as pd

In [3]:
df = pd.read_csv('data/data.csv')

In [7]:
data = pd.read_json('data/data.json')
data


Unnamed: 0,name,age
0,Alice,25


In [8]:
import json

data = json.load(open('data/data.json'))
data

[{'name': 'Alice', 'age': 25}]

In [9]:
df = pd.read_csv('data/example.txt')
df

Unnamed: 0,Hello,world!
0,This is an appended line.,


In [10]:
df = pd.read_csv('data/example.txt')
df

Unnamed: 0,Hello world!
0,This is an appended line.


In [11]:
file_path = 'https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv'
df = pd.read_csv(file_path)

In [14]:
df.to_csv('data/df_titanic.csv', index=None)

In [15]:
df.to_csv('data/df_titanic_custom_sep.tsv', index=None, sep='\t')

In [16]:
df.to_csv('data/df_titanic_custom_sep.txt', index=None, sep='|')

In [19]:
df = pd.read_csv('data/df_titanic_custom_sep.txt', sep='|')
# df

In [20]:
df.to_excel('data/titanic.xlsx', index=None)

### 2. Columns
- **Selecting Columns**: Learn how to access specific columns.
- **Adding and Renaming Columns**: Create new columns and rename existing ones for clarity.


In [23]:
df.columns, len(df.columns)

(Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
        'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
       dtype='object'),
 12)

In [27]:
df.columns = [x.lower() for x in df.columns]

In [36]:
type(df['name']), type(df[['name']])


(pandas.core.series.Series, pandas.core.frame.DataFrame)

In [39]:
df['family_size'] = df['sibsp'] + df['parch']

In [41]:
df['fare_per_person'] = df['fare']/(df['family_size']+1)

In [44]:
df['is_alone'] = df['family_size'].apply(lambda x: 1 if x==0 else 0)

In [51]:
df['is_alone_bool'] = (~df['family_size'].astype('bool')).astype('int')

In [57]:
df['is_alone_condition'] = (df['family_size']==0).astype('int')

In [60]:
df = df.rename(columns={'sibsp': 'sibling/spouse aboard'})

In [62]:
df = df.rename(columns={'pclass': 'passenger_class', 'sex': 'gender'})

In [66]:
df['is_adult'] = (df['age']>=18).astype(int)

In [73]:
df[['age', 'is_adult', 'sibling/spouse aboard', 'fare']]

numerical_cols =df.select_dtypes(include='number')
numerical_cols

object_cols = df.select_dtypes(exclude='number')
object_cols.columns

Index(['name', 'gender', 'ticket', 'cabin', 'embarked'], dtype='object')

##### Head, Tail, Sample

In [80]:
df.head(1)

Unnamed: 0,passengerid,survived,passenger_class,name,gender,age,sibling/spouse aboard,parch,ticket,fare,cabin,embarked,family_size,fare_per_person,is_alone,is_alone_bool,is_alone_condition,is_adult
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,3.625,0,0,0,1


In [83]:
df.tail(1)

Unnamed: 0,passengerid,survived,passenger_class,name,gender,age,sibling/spouse aboard,parch,ticket,fare,cabin,embarked,family_size,fare_per_person,is_alone,is_alone_bool,is_alone_condition,is_adult
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q,0,7.75,1,1,1,1


In [112]:
df.sample(5)
df.sample(frac=0.01)

df.sample(frac=0.01, random_state=1)

Unnamed: 0,passengerid,survived,passenger_class,name,gender,age,sibling/spouse aboard,parch,ticket,fare,cabin,embarked,family_size,fare_per_person,is_alone,is_alone_bool,is_alone_condition,is_adult
862,863,1,1,"Swift, Mrs. Frederick Joel (Margaret Welles Ba...",female,48.0,0,0,17466,25.9292,D17,S,0,25.9292,1,1,1,1
223,224,0,3,"Nenkoff, Mr. Christo",male,,0,0,349234,7.8958,,S,0,7.8958,1,1,1,0
84,85,1,2,"Ilett, Miss. Bertha",female,17.0,0,0,SO/C 14885,10.5,,S,0,10.5,1,1,1,0
680,681,0,3,"Peters, Miss. Katie",female,,0,0,330935,8.1375,,Q,0,8.1375,1,1,1,0
535,536,1,2,"Hart, Miss. Eva Miriam",female,7.0,0,2,F.C.C. 13529,26.25,,S,2,8.75,0,0,0,0
623,624,0,3,"Hansen, Mr. Henry Damsgaard",male,21.0,0,0,350029,7.8542,,S,0,7.8542,1,1,1,1
148,149,0,2,"Navratil, Mr. Michel (""Louis M Hoffman"")",male,36.5,0,2,230080,26.0,F2,S,2,8.666667,0,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,26.55,0,0,0,1
34,35,0,1,"Meyer, Mr. Edgar Joseph",male,28.0,1,0,PC 17604,82.1708,,C,1,41.0854,0,0,0,1


### 3. DataFrame and Series
- **Introduction to DataFrame and Series**: Understand the basics of DataFrame and Series, the core structures in Pandas.
- **Indexing and Slicing**: Learn how to access specific rows, columns, or subsets of data.


In [113]:
df.columns

Index(['passengerid', 'survived', 'passenger_class', 'name', 'gender', 'age',
       'sibling/spouse aboard', 'parch', 'ticket', 'fare', 'cabin', 'embarked',
       'family_size', 'fare_per_person', 'is_alone', 'is_alone_bool',
       'is_alone_condition', 'is_adult'],
      dtype='object')

In [115]:
type(df['age'])

pandas.core.series.Series

In [128]:
df[5:10]

# .iloc

type(df.iloc[0])

df.iloc[0:2]

df.iloc[10:20]

df.iloc[10:20]['passenger_class']

df.iloc[1,3]

'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'

In [130]:
df.loc[:1]

Unnamed: 0,passengerid,survived,passenger_class,name,gender,age,sibling/spouse aboard,parch,ticket,fare,cabin,embarked,family_size,fare_per_person,is_alone,is_alone_bool,is_alone_condition,is_adult
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,3.625,0,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,35.64165,0,0,0,1



### 4. Info, Shape, Duplicated, and Drop
- **Data Overview with `.info()` and `.shape()`**: Get a summary of the dataset and its dimensions.
- **Identifying and Dropping Duplicates**: Learn how to find and remove duplicate entries.
- **Dropping Unnecessary Data**: Remove columns or rows that are not needed.


In [132]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   passengerid            891 non-null    int64  
 1   survived               891 non-null    int64  
 2   passenger_class        891 non-null    int64  
 3   name                   891 non-null    object 
 4   gender                 891 non-null    object 
 5   age                    714 non-null    float64
 6   sibling/spouse aboard  891 non-null    int64  
 7   parch                  891 non-null    int64  
 8   ticket                 891 non-null    object 
 9   fare                   891 non-null    float64
 10  cabin                  204 non-null    object 
 11  embarked               889 non-null    object 
 12  family_size            891 non-null    int64  
 13  fare_per_person        891 non-null    float64
 14  is_alone               891 non-null    int64  
 15  is_alo

In [135]:
df.isnull().sum()

df.shape

(891, 18)

In [136]:
df.describe()

Unnamed: 0,passengerid,survived,passenger_class,age,sibling/spouse aboard,parch,fare,family_size,fare_per_person,is_alone,is_alone_bool,is_alone_condition,is_adult
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,0.904602,19.916375,0.602694,0.602694,0.602694,0.674523
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,1.613459,35.841257,0.489615,0.489615,0.489615,0.468816
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104,0.0,7.25,0.0,0.0,0.0,0.0
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542,0.0,8.3,1.0,1.0,1.0,1.0
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0,1.0,23.666667,1.0,1.0,1.0,1.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292,10.0,512.3292,1.0,1.0,1.0,1.0


In [138]:
df.duplicated().sum()

0

In [139]:
df = df.drop_duplicates()

In [142]:
df.isnull().sum()

df1 = df.dropna()
df.shape, df1.shape

((891, 18), (183, 18))

In [144]:
df1 = df.dropna(subset=['embarked'])
df.shape, df1.shape

((891, 18), (889, 18))

In [143]:
df1.isnull().sum()

passengerid              0
survived                 0
passenger_class          0
name                     0
gender                   0
age                      0
sibling/spouse aboard    0
parch                    0
ticket                   0
fare                     0
cabin                    0
embarked                 0
family_size              0
fare_per_person          0
is_alone                 0
is_alone_bool            0
is_alone_condition       0
is_adult                 0
dtype: int64

### 5. Filtering Data
- **Basic Filtering**: Learn how to filter DataFrame rows based on conditions.
- **Multiple Conditions**: Apply multiple conditions to filter data.


In [161]:
flag = df['age'].isnull()
df[flag]

df1 = df[~flag]

df1[df1['age']>50]

flag = (df1['age']>50) & (df1['age']<60)
df1[flag]

df[(df['gender']=='male')&(df['is_adult']==0)]

Unnamed: 0,passengerid,survived,passenger_class,name,gender,age,sibling/spouse aboard,parch,ticket,fare,cabin,embarked,family_size,fare_per_person,is_alone,is_alone_bool,is_alone_condition,is_adult
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,0,8.458300,1,1,1,0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S,4,4.215000,0,0,0,0
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.1250,,Q,5,4.854167,0,0,0,0
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S,0,13.000000,1,1,1,0
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C,0,7.225000,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
850,851,0,3,"Andersson, Master. Sigvard Harald Elias",male,4.0,4,2,347082,31.2750,,S,6,4.467857,0,0,0,0
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C,0,7.229200,1,1,1,0
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S,0,9.500000,1,1,1,0
869,870,1,3,"Johnson, Master. Harold Theodor",male,4.0,1,1,347742,11.1333,,S,2,3.711100,0,0,0,0


### 6. NaN and Null Values
- **Identifying and Handling Missing Data**: Discover how to find and manage NaN or null values in your dataset.


In [166]:
df.isnull().sum()

# fillna
df1 = df.fillna('filling with something')

In [172]:
df1 = df.copy()
df1['cabin'] = df['cabin'].fillna('filling with something')

In [177]:
# df1 = df.copy()

df1 = df.fillna({'cabin': 'fill something in cabin', 'embarked': 'fill something'})

In [182]:
df1 = df.copy()

df1['cabin'] = df1['cabin'].ffill()
df1['cabin'] = df1['cabin'].bfill()


In [186]:
df1 = df1.dropna(thresh=len(df1)*0.99, axis=1)

### 7. Imputation
- **Filling Missing Data**: Replace NaN values with appropriate values using different strategies.


In [194]:
df1 = df.copy()

# df1['age'] = df1['age'].fillna(df1['age'].mean())
# df1['age'].fillna(df1['age'].mean(), inplace=True)

# df1.fillna({'age': df1['age'].mean()}, inplace=True)

In [198]:
df1['age'].median(), df1['age'].mean()

(29.69911764705882, 29.69911764705882)

In [196]:
df1.fillna({'age': df1['age'].median()}, inplace=True)

In [202]:
df1['cabin'].mode()[2]

'G6'

In [204]:
df1.fillna({'cabin': df1['cabin'].mode()[2]}, inplace=True)

In [207]:
age1 = df1['age']

df1 = df.copy()

df1['age'] = df1['age'].interpolate()

age2 = df1['age']

In [212]:
age = pd.DataFrame([df['age'].fillna(df['age'].mean()), age2]).T

### 8. Lambda Function
- **Applying Functions with `apply()`**: Use lambda functions for custom operations across rows or columns.


In [216]:
df['age'] = df['age'].interpolate()

In [224]:
df['is_teen'] = df['age'].apply(lambda x: 1 if (x<18) & (x>14) else 0)
df['is_child'] = df['age'].apply(lambda x: 1 if x<=14 else 0)

df['is_senior'] = df['age'].apply(lambda x: 1 if x>=60 else 0)

In [229]:
df['is_teen'].value_counts()
df['is_child'].value_counts()
df['is_senior'].value_counts()

df[df['is_senior']==1][['age', 'is_senior']].head()

Unnamed: 0,age,is_senior
33,66.0,1
54,65.0,1
95,65.0,1
96,71.0,1
116,70.5,1


In [230]:
df['fare_per_family_person'] = df.apply(lambda x: x['fare']/(x['family_size']+1), axis=1)

In [239]:
df['name'].str.len()
df['name_length'] = df['name'].apply(lambda x: len(x))

In [247]:
df['title'] = df['name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())

In [249]:
df.head(3)

Unnamed: 0,passengerid,survived,passenger_class,name,gender,age,sibling/spouse aboard,parch,ticket,fare,...,is_alone,is_alone_bool,is_alone_condition,is_adult,is_teen,is_child,is_senior,fare_per_family_person,name_length,title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,0,0,0,1,0,0,0,3.625,23,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,0,0,0,1,0,0,0,35.64165,51,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,...,1,1,1,1,0,0,0,7.925,22,Miss


### 9. Grouping and Aggregation
- **Grouping Data**: Learn how to group data by one or more columns.
- **Aggregating Data**: Perform operations like mean, sum, and count on grouped data.


In [252]:
# mostly-> grouping on categorical datatype and agg-> numerical datatype
df.columns

Index(['passengerid', 'survived', 'passenger_class', 'name', 'gender', 'age',
       'sibling/spouse aboard', 'parch', 'ticket', 'fare', 'cabin', 'embarked',
       'family_size', 'fare_per_person', 'is_alone', 'is_alone_bool',
       'is_alone_condition', 'is_adult', 'is_teen', 'is_child', 'is_senior',
       'fare_per_family_person', 'name_length', 'title'],
      dtype='object')

In [259]:
df1 = df.groupby('passenger_class')
df1.groups.keys()

dict_keys([1, 2, 3])

In [261]:
df1['age'].mean()

passenger_class
1    37.033395
2    29.713605
3    26.516096
Name: age, dtype: float64

In [262]:
df1 = df.groupby('gender')
df1['age'].mean()

gender
female    28.570860
male      30.354714
Name: age, dtype: float64

In [265]:
df1 = df.groupby(['gender', 'passenger_class'])

df1.groups

df1['age'].mean()

gender  passenger_class
female  1                  34.084681
        2                  28.697018
        3                  24.904977
male    1                  39.305355
        2                  30.428981
        3                  27.184688
Name: age, dtype: float64

In [267]:
df1[['age', 'fare']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,age,fare
gender,passenger_class,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1,34.084681,106.125798
female,2,28.697018,21.970121
female,3,24.904977,16.11881
male,1,39.305355,67.226127
male,2,30.428981,19.741782
male,3,27.184688,12.661633


In [269]:
df1.agg({'age': ['mean', 'median'], 'fare': ['sum', 'mean', 'median']})

Unnamed: 0_level_0,Unnamed: 1_level_0,age,age,fare,fare,fare
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median,sum,mean,median
gender,passenger_class,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
female,1,34.084681,34.0,9975.825,106.125798,82.66455
female,2,28.697018,28.0,1669.7292,21.970121,22.0
female,3,24.904977,25.0,2321.1086,16.11881,12.475
male,1,39.305355,37.666667,8201.5875,67.226127,41.2625
male,2,30.428981,29.75,2132.1125,19.741782,13.0
male,3,27.184688,26.5,4393.5865,12.661633,7.925


In [271]:
df['embarked'].value_counts()

embarked
S    644
C    168
Q     77
Name: count, dtype: int64

In [274]:
df1 = df.groupby('embarked')
df1['passengerid'].count()

embarked
C    168
Q     77
S    644
Name: passengerid, dtype: int64

### 10. Merging and Joining DataFrames
- **Combining DataFrames**: Learn how to merge or join DataFrames.
- **Different Types of Joins**: Understand inner, outer, left, and right joins.


In [283]:
df1 = df[['passengerid', 'name', 'age']].head(5)
df2 = df[['passengerid', 'fare', 'embarked']].head(10)

In [285]:
df1


Unnamed: 0,passengerid,name,age
0,1,"Braund, Mr. Owen Harris",22.0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,3,"Heikkinen, Miss. Laina",26.0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,5,"Allen, Mr. William Henry",35.0


In [287]:
# df2

In [284]:
df3 = pd.merge(df1, df2, on='passengerid', how='inner')
df3

Unnamed: 0,passengerid,name,age,fare,embarked
0,1,"Braund, Mr. Owen Harris",22.0,7.25,S
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,71.2833,C
2,3,"Heikkinen, Miss. Laina",26.0,7.925,S
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,53.1,S
4,5,"Allen, Mr. William Henry",35.0,8.05,S


In [288]:
df3 = pd.merge(df1, df2, on='passengerid', how='left')
df3

Unnamed: 0,passengerid,name,age,fare,embarked
0,1,"Braund, Mr. Owen Harris",22.0,7.25,S
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,71.2833,C
2,3,"Heikkinen, Miss. Laina",26.0,7.925,S
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,53.1,S
4,5,"Allen, Mr. William Henry",35.0,8.05,S


In [289]:
df3 = pd.merge(df1, df2, on='passengerid', how='right')
df3

Unnamed: 0,passengerid,name,age,fare,embarked
0,1,"Braund, Mr. Owen Harris",22.0,7.25,S
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,71.2833,C
2,3,"Heikkinen, Miss. Laina",26.0,7.925,S
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,53.1,S
4,5,"Allen, Mr. William Henry",35.0,8.05,S
5,6,,,8.4583,Q
6,7,,,51.8625,S
7,8,,,21.075,S
8,9,,,11.1333,S
9,10,,,30.0708,C


In [290]:
df3 = pd.merge(df1, df2, on='passengerid', how='outer')
df3

Unnamed: 0,passengerid,name,age,fare,embarked
0,1,"Braund, Mr. Owen Harris",22.0,7.25,S
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,71.2833,C
2,3,"Heikkinen, Miss. Laina",26.0,7.925,S
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,53.1,S
4,5,"Allen, Mr. William Henry",35.0,8.05,S
5,6,,,8.4583,Q
6,7,,,51.8625,S
7,8,,,21.075,S
8,9,,,11.1333,S
9,10,,,30.0708,C


In [293]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,passengerid,name,age,passengerid.1,fare,embarked
0,1.0,"Braund, Mr. Owen Harris",22.0,1,7.25,S
1,2.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,2,71.2833,C
2,3.0,"Heikkinen, Miss. Laina",26.0,3,7.925,S
3,4.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,4,53.1,S
4,5.0,"Allen, Mr. William Henry",35.0,5,8.05,S
5,,,,6,8.4583,Q
6,,,,7,51.8625,S
7,,,,8,21.075,S
8,,,,9,11.1333,S
9,,,,10,30.0708,C


In [None]:
# pd.append => pd.concat([df1, df2], axis=0)

### 11. Sorting Data
- **Sorting Rows**: Learn how to sort data by one or more columns.
- **Sorting with Different Orders**: Understand how to sort in ascending or descending order.


In [304]:
df.sort_values(by='age')

df.sort_values(by='age', ascending=False)

df.sort_values(by=['age', 'fare'])

df1 = df.sort_values(by=['passenger_class', 'fare'])

df1.sort_index().head(1)

Unnamed: 0,passengerid,survived,passenger_class,name,gender,age,sibling/spouse aboard,parch,ticket,fare,...,is_alone,is_alone_bool,is_alone_condition,is_adult,is_teen,is_child,is_senior,fare_per_family_person,name_length,title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,0,0,0,1,0,0,0,3.625,23,Mr


In [309]:
df.sort_values(by='name_length')

df1 = df.sort_values(by=['passenger_class', 'fare'], ascending=[True, False])


### 12. Handling Categorical Data
- **Working with Categorical Data**: Learn how to handle and manipulate categorical data in Pandas.
- **Converting Categories to Numeric**: Convert categorical data to numeric for machine learning.


In [314]:
df['gender'].dtype
# df.info()

CategoricalDtype(categories=['female', 'male'], ordered=False, categories_dtype=object)

In [315]:
df['gender'] = df['gender'].astype('category')
# df.info()

In [318]:
df['gender'].value_counts().index

CategoricalIndex(['male', 'female'], categories=['female', 'male'], ordered=False, dtype='category', name='gender')

In [321]:
df['gender'].unique()

['male', 'female']
Categories (2, object): ['female', 'male']

In [323]:
df['gender'].nunique()

2

In [327]:
df['gender'].cat.codes

gender = pd.get_dummies(df['gender']).astype('int')
# gender

In [333]:
df['gender'].apply(lambda x: 'm' if x=='male' else 'f')
# df['gender'].replace({'male': 'm', 'female': 'f'})

df['gender'].cat.rename_categories({'male': 'm', 'female': 'f'})

df['gender'].map({'male': 'm', 'female': 'f'})

0      m
1      f
2      f
3      f
4      m
      ..
886    m
887    f
888    f
889    m
890    m
Name: gender, Length: 891, dtype: category
Categories (2, object): ['f', 'm']

### 13. Handling Dates
- **Working with Date Data**: Learn how to handle date and time data in Pandas.
- **Date-based Indexing and Resampling**: Perform operations based on dates.


In [341]:
df = pd.read_csv('https://github.com/laxmimerit/All-CSV-ML-Data-Files-Download/raw/master/jamesbond.csv')
df.head()

df['Year'] = pd.to_datetime(df['Year'], format='%Y')

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [351]:
df['Year'].dt.year
df['Year'].dt.month

df['Year'].dt.month_name()
df['Year'].dt.day

pd.to_datetime('now').year
pd.to_datetime('now').month_name()
pd.to_datetime('now').day_name()

'Thursday'

In [358]:
df[df['Year']>pd.to_datetime('2000')]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
21,Die Another Day,2002-01-01,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
22,Casino Royale,2006-01-01,Daniel Craig,Martin Campbell,581.5,145.3,3.3
23,Quantum of Solace,2008-01-01,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,Skyfall,2012-01-01,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,Spectre,2015-01-01,Daniel Craig,Sam Mendes,726.7,206.3,


In [362]:
pd.date_range(start='2000', end='2002', freq='ME')

DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-30',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-30', '2000-10-31', '2000-11-30', '2000-12-31',
               '2001-01-31', '2001-02-28', '2001-03-31', '2001-04-30',
               '2001-05-31', '2001-06-30', '2001-07-31', '2001-08-31',
               '2001-09-30', '2001-10-31', '2001-11-30', '2001-12-31'],
              dtype='datetime64[ns]', freq='ME')