Pandas is a popular library used for data analysis and data manipulation.

It allows to read data from many different sources.

**Pandas Data Structures**

Series - One dimensional data structure, contain index information

Dataframes - Two dimensional data structure, contain index information

In [1]:
##### Importing Pandas Library #####

import pandas as pd

In [2]:
#### Creating Pandas Series and DataFrames ####

s = pd.Series([10, 77, 12, 4, 5])

dfz = pd.DataFrame([10, 77, 12, 4, 5])

In [3]:
### Basic Methods ###

# index: -index information
# dtype: -data type of elements
# size:  -total number of elements
# ndim:  -dimension info(1D, 2D, 3D)
# head:  -first 5 values
# tail:  -last 5 values
# shape: -size of each dimension(rows & columns for 2D)
# values: -return the values (as an array)


## Examples:

s.index
dfz.index
s.head()
dfz.tail()

Unnamed: 0,0
0,10
1,77
2,12
3,4
4,5


**Importing dataset from Seaborn to Pandas DataFrame**

In [4]:
import pandas as pd
import seaborn as sns

#df = sns.load_dataset("titanic")
df = pd.read_csv("/kaggle/input/titanic-dataset-csv/titanic.csv")

df.head(2)

df.tail(1)

df.shape  ##output: (891, 15) -- 891 rows and 15 columns

df.info()

df.columns

df.index  ##output: RangeIndex(start=0, stop=891, step=1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB


RangeIndex(start=0, stop=891, step=1)

**Selection in Pandas**

In [5]:
import pandas as pd
import seaborn as sns

#df = sns.load_dataset("titanic")
df = pd.read_csv("/kaggle/input/titanic-dataset-csv/titanic.csv")

df[0:13] ##slicing first 13 rows

df[["age"]] ##selecting "age" variable column

df[["age", "alive"]] ##selecting more than one column

col_names = ["age", "adult_male", "alive"]
df[col_names]     ##selecting multiple columns

df["age_square"] = df["age"]**2  ##Adding new column

**Deleting Rows - Columns**

In [6]:
#df.drop(0, axis=0).head()             ## 0 ---> index zero , axis=0 ---> Row

#delete_indexes = [1, 3, 5, 7]
#df.drop(delete_indexes, axis=0).head(10)     ## deletes index 1, 3, 5, and 7


## İşlemin kalıcı olması için (To modify  the existing dataframe 'df') :
#df = df.drop(delete_indexes, axis=0)               ##OR
#df.drop(delete_indexes, axis=0, inplace=True)

##deleting columns

#df.drop("age", axis=1)                      # deletes column 'age'

#col_names = ["age", "adult_male", "alive"]
#df.drop(col_names, axis=1)      # deletes 3 columns

**Converting Variable to an Index**

In [7]:
df.index = df["age"]
## The "age" column is now set as the index of the DataFrame

#df.drop("age", axis=1)
## drop the "age" column from the DataFrame

#df.drop("age", axis=1, inplace=True) 
# This permanently drops

**Converting Index to a Variable**

In [8]:
df["age"] = df.index
                      ### OR ###
#df.reset_index()

**Loc and iloc Methods**

In [9]:
df = pd.read_csv("/kaggle/input/titanic-dataset-csv/titanic.csv")

df.iloc[0, 0]  ## Accessing first row and first column

df.iloc[0:2]   ## Accessing first and second row

df.loc[0:1, 'age'] ##Accessing with label

print(df.loc[0:2])    ## Accessing first, second, and third row





   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  


**Conditional Selection**

In [10]:
df[df['age'] > 50] ##select rows where the age > 50

df.loc[df['age'] > 50, 'class'] ##select class information

df.loc[df['age'] > 50, ['age', 'class']] ##select age and class information

df.loc[(df['age'] > 50) & (df['sex'] == 'male'), ['age', 'class']] ## and (&) condition

Unnamed: 0,age,class
6,54.0,First
33,66.0,Second
54,65.0,First
94,59.0,Third
96,71.0,First
116,70.5,Third
124,54.0,First
150,51.0,Second
152,55.5,Third
155,51.0,First


**Aggregation & Grouping**

In [11]:
###Group by

df["age"].mean() ##output: 29.69911764705882 --gives the mean of age

df.groupby("sex")["age"].mean() ##gives the mean of age separately



###aggregation

df.groupby("sex").agg({"age" : "mean"})




##Group by two levels

df.groupby(["sex", "embark_town"]).agg({"age" : ["mean"], "survived": "mean"})



Unnamed: 0_level_0,Unnamed: 1_level_0,age,survived
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,mean
sex,embark_town,Unnamed: 2_level_2,Unnamed: 3_level_2
female,Cherbourg,28.344262,0.876712
female,Queenstown,24.291667,0.75
female,Southampton,27.771505,0.689655
male,Cherbourg,32.998841,0.305263
male,Queenstown,30.9375,0.073171
male,Southampton,30.29144,0.174603


**Pivot Table**

In [12]:
###Pivot table (similar to groupby)

df.pivot_table("survived", "sex", ["embarked", "class"])

#survived ---> values argument
#sex ------> index-row
#embarked, class ------> columns


# Cut function

df["new_age"] = pd.cut(df["age"], [0, 10, 18, 25, 40, 90]) ## create a range of ages

df.head()

df.pivot_table("survived", "sex", ["new_age", "class"]) ##add new_age and class column

  df.pivot_table("survived", "sex", ["new_age", "class"]) ##add new_age and class column


new_age,"(0, 10]","(0, 10]","(0, 10]","(10, 18]","(10, 18]","(10, 18]","(18, 25]","(18, 25]","(18, 25]","(25, 40]","(25, 40]","(25, 40]","(40, 90]","(40, 90]","(40, 90]"
class,First,Second,Third,First,Second,Third,First,Second,Third,First,Second,Third,First,Second,Third
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
female,0.0,1.0,0.5,1.0,1.0,0.52381,0.941176,0.933333,0.5,1.0,0.90625,0.464286,0.961538,0.846154,0.111111
male,1.0,1.0,0.363636,0.666667,0.0,0.103448,0.333333,0.047619,0.115385,0.513514,0.071429,0.172043,0.28,0.095238,0.064516


**Apply & Lambda**

In [13]:
## with function

#def age_funct(age):
#    return 1 if age < 30 else 0

## with apply and lambda

df['age_flag'] = df['age'].apply(lambda age: 1 if age < 30 else 0)
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,new_age,age_flag
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,"(18, 25]",1
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,"(25, 40]",0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,"(25, 40]",1
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,"(25, 40]",0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,"(25, 40]",0


**JOIN**

In [14]:
##Concat

df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])

df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])

pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,letter,number
0,a,1
1,b,2
2,c,3
3,d,4


In [15]:
##Merge

df1 = pd.DataFrame({'employees': ['john', 'dennis', 'mark', 'maria'], 'group': ['accounting', 'engineering', 'engineering', 'hr']})

df2 = pd.DataFrame({'employees': ['john', 'dennis', 'mark', 'maria'], 'start_date': [2010, 2009, 2014, 2019]})

pd.merge(df1, df2)

Unnamed: 0,employees,group,start_date
0,john,accounting,2010
1,dennis,engineering,2009
2,mark,engineering,2014
3,maria,hr,2019


In [16]:
##Merge with another dataframe(df4)

df3 = pd.merge(df1, df2)

df4 = pd.DataFrame({'group': ['accounting', 'engineering', 'hr'], 'manager': ['Caner', 'Mustafa', 'Berkcan']})

pd.merge(df3, df4)

Unnamed: 0,employees,group,start_date,manager
0,john,accounting,2010,Caner
1,dennis,engineering,2009,Mustafa
2,mark,engineering,2014,Mustafa
3,maria,hr,2019,Berkcan
