# Pandas
---
- **Pandas Series**; index bilgisi bulunduran **_tek_** boyutlu bir veri yapısıdır.
- **Pandas DataFrame**; index bilgisi bulunduran **_çok_** boyutlu bir veri yapısıdır.

In [1]:
import pandas as pd
s = pd.Series([10, 77, 12, 4, 5]) # bir listeyi pandas series'e çevirdik.
s

0    10
1    77
2    12
3     4
4     5
dtype: int64

In [2]:
type(s)

pandas.core.series.Series

In [3]:
s.index # index bilgileri.

RangeIndex(start=0, stop=5, step=1)

In [4]:
s.dtype # series içersinde bulunan değelerin veri tipi.

dtype('int64')

In [5]:
s.size # series içerisinde bulunan eleman sayısı.

5

In [6]:
s.ndim # series'in boyut bilgisi. pandas series tek boyutlu.

1

In [7]:
s.values # series içerisinde bulunan değerler.

array([10, 77, 12,  4,  5])

- values ile değerlere ulaştığımızda, fark edilebilecek üzere bize _numpy array_ döndürdü. Nedeni _"biz sadece değerlerle ilgileniyoruz, index bilgisi önemsiz"_ demiş olduk.

In [8]:
s.head() # içerisinde bulunan ilk 5 değeri döndürür.

0    10
1    77
2    12
3     4
4     5
dtype: int64

In [9]:
s.tail() # içerisinde bulunan son 5 değeri döndürür.

0    10
1    77
2    12
3     4
4     5
dtype: int64

### Reading Data

In [10]:
df = pd.read_csv("datasets/advertising.csv")
df.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


### Selection in Pandas

In [11]:
import pandas as pd
import seaborn as sns
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [12]:
df.index

RangeIndex(start=0, stop=891, step=1)

In [13]:
df[0:13] # 0'dan (0 dahil) 13'e (13 hariç) kadar slicing işlemi.

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


In [15]:
df.drop(0, axis=0).head() # veri setinden index üzerinden gözlem silme

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True


In [14]:
delete_indexes = [1, 3, 5, 7]
df.drop(delete_indexes, axis = 0).head(10) # fancy index ile index silme.

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False
10,1,3,female,4.0,1,1,16.7,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.55,S,First,woman,False,C,Southampton,yes,True
12,0,3,male,20.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
13,0,3,male,39.0,1,5,31.275,S,Third,man,True,,Southampton,no,False


- bu işlemlerin kalıcı olması için dataFrame tekrar atanabilir
- **df = df.drop(delete_indexes, axis=0).head(10)**
- **inplace=True** argümanı girilebilir
- **df.drop(delete_indexes, axis=0, inplace=True)**

### Değişken'i Index'e Çevirmek;
- Seçim işlemini iki farklı şekilde yapabiliriz.

In [16]:
df["age"].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64

In [17]:
df.age.head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64

In [18]:
df.index = df["age"] # "age" değişkeni index'e çevirdik.

In [19]:
df.drop("age", axis=1).head() # "age değişkenini, değişkenler arasından kaldırdık."

Unnamed: 0_level_0,survived,pclass,sex,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
22.0,0,3,male,1,0,7.25,S,Third,man,True,,Southampton,no,False
38.0,1,1,female,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
26.0,1,3,female,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
35.0,1,1,female,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
35.0,0,3,male,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Index'i Değişkene Çevirmek;

In [20]:
df.index

Float64Index([22.0, 38.0, 26.0, 35.0, 35.0,  nan, 54.0,  2.0, 27.0, 14.0,
              ...
              33.0, 22.0, 28.0, 25.0, 39.0, 27.0, 19.0,  nan, 26.0, 32.0],
             dtype='float64', name='age', length=891)

In [21]:
df["age"] = df.index

In [22]:
df.head()

Unnamed: 0_level_0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
22.0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
38.0,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
26.0,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
35.0,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
35.0,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [23]:
df.drop("age", axis=1, inplace=True)

In [24]:
df.reset_index().head()

Unnamed: 0,age,survived,pclass,sex,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,22.0,0,3,male,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,38.0,1,1,female,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,26.0,1,3,female,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,35.0,1,1,female,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,35.0,0,3,male,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [25]:
df = df.reset_index()

In [26]:
df.head()

Unnamed: 0,age,survived,pclass,sex,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,22.0,0,3,male,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,38.0,1,1,female,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,26.0,1,3,female,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,35.0,1,1,female,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,35.0,0,3,male,0,0,8.05,S,Third,man,True,,Southampton,no,True


- Index'i değişkene çevirmek için iki yol kullandık.
    - bir dataframe'e yeni değişken eklemek için o dataframe içerisinde olmayan bir isimlendirme girdik ve yeni bir değişken eklemiş olduk.
    - **reset_index** metodu ile hem index'i sildi hem de sildiği index'i sütun olarak ekledi.

### Değişkenler Üzerinde İşlemler

In [27]:
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [28]:
"age" in df # dataframe içerisinde "age" değişkeninin varlğını sorguladık.

True

In [29]:
df["age"].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64

In [30]:
type(df["age"].head()) # tek köşeli parantez ile belirtirsek pandas series olarak döner.

pandas.core.series.Series

In [31]:
type(df[["age"]].head()) # pandas dataframe olarak dönmesini istiyorsak çift köşeli parantez kullanmalıyız.

pandas.core.frame.DataFrame

In [33]:
df[["age", "alive"]].head()

Unnamed: 0,age,alive
0,22.0,no
1,38.0,yes
2,26.0,yes
3,35.0,yes
4,35.0,no


In [34]:
col_names = ["age", "adult_male", "alive"] # fancy index kullanarak istediğimiz değişkenleri aldık.
df[col_names].head()

Unnamed: 0,age,adult_male,alive
0,22.0,True,no
1,38.0,False,yes
2,26.0,False,yes
3,35.0,False,yes
4,35.0,True,no


In [35]:
df["age2"] = df["age"]**2
df["age3"] = df["age"] / df["age2"]
# veri setimize yeni değişkenler ekledik.

In [36]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age2,age3
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,484.0,0.045455
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1444.0,0.026316
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,676.0,0.038462
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,1225.0,0.028571
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,1225.0,0.028571


In [37]:
df.drop("age3", axis=1).head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age2
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,484.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1444.0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,676.0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,1225.0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,1225.0


In [38]:
df.drop(col_names, axis=1).head()

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,embarked,class,who,deck,embark_town,alone,age2,age3
0,0,3,male,1,0,7.25,S,Third,man,,Southampton,False,484.0,0.045455
1,1,1,female,1,0,71.2833,C,First,woman,C,Cherbourg,False,1444.0,0.026316
2,1,3,female,0,0,7.925,S,Third,woman,,Southampton,True,676.0,0.038462
3,1,1,female,1,0,53.1,S,First,woman,C,Southampton,False,1225.0,0.028571
4,0,3,male,0,0,8.05,S,Third,man,,Southampton,True,1225.0,0.028571


In [39]:
df.loc[:, ~df.columns.str.contains("age")].head() 

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,0,0,8.05,S,Third,man,True,,Southampton,no,True


- loc sayesinde içerisinde "age" bulunduran değişkenler hariç diğer değişkenleri görüntüledik.
- _"~" dışında_ gibi tercüme edebiliriz.

### loc & iloc
---
- veri seti içerisinde seçim işlemleri yapmak için kullanılır.
- **loc**; _label based selection_, **etiket bilgisine** göre seçim işlemi yapar.
- **iloc**; _integer based selection_, **index bilgisine** göre seçim işlemi yapar.

In [40]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age2,age3
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,484.0,0.045455
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1444.0,0.026316
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,676.0,0.038462
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,1225.0,0.028571
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,1225.0,0.028571


In [41]:
df.iloc[0:3]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age2,age3
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,484.0,0.045455
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1444.0,0.026316
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,676.0,0.038462


In [42]:
df.iloc[0, 0]

0

In [43]:
df.loc[0:3]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age2,age3
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,484.0,0.045455
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1444.0,0.026316
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,676.0,0.038462
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,1225.0,0.028571


In [44]:
df.iloc[0:3, 0:3]

Unnamed: 0,survived,pclass,sex
0,0,3,male
1,1,1,female
2,1,3,female


In [45]:
df.loc[0:3, "age"]

0    22.0
1    38.0
2    26.0
3    35.0
Name: age, dtype: float64

In [46]:
col_names = ["age", "embarked", "alive"]
df.loc[0:3, col_names]

Unnamed: 0,age,embarked,alive
0,22.0,S,no
1,38.0,C,yes
2,26.0,S,yes
3,35.0,S,yes


### Conditional Selection

In [47]:
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [48]:
df[df["age"] > 50].head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
11,1,1,female,58.0,0,0,26.55,S,First,woman,False,C,Southampton,yes,True
15,1,2,female,55.0,0,0,16.0,S,Second,woman,False,,Southampton,yes,True
33,0,2,male,66.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
54,0,1,male,65.0,0,1,61.9792,C,First,man,True,B,Cherbourg,no,False


- veri seti içerisinde yaşı **50'den büyük kaç kişi var?** sorusunun cevabı.

In [49]:
df[df["age"] > 50]["age"].count()

64

- belirli bir koşulu 2 değişken ile incelemek istersek;

In [50]:
df.loc[df["age"] > 50, ["age", "class"]].head() 

Unnamed: 0,age,class
6,54.0,First
11,58.0,First
15,55.0,Second
33,66.0,Second
54,65.0,First


- 50 yaşından büyük erkekleri incelemek istiyoruz.
- birden fazla koşul kullanılacaksa, her koşulu parantez içine almamız gerekiyor.

In [51]:
df.loc[(df["age"] > 50) & (df["sex"] == "male"), ["age", "class"]].head()

Unnamed: 0,age,class
6,54.0,First
33,66.0,Second
54,65.0,First
94,59.0,Third
96,71.0,First


In [52]:
df["embark_town"].value_counts()

Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64

In [53]:
df_new = df.loc[(df["age"] > 50) & (df["sex"] == "male")
               & ((df["embark_town"] == "Cherbourg") | (df["embark_town"] == "Southampton")),
               ["age", "class", "embark_town"]]

In [54]:
df_new.head()

Unnamed: 0,age,class,embark_town
6,54.0,First,Southampton
33,66.0,Second,Southampton
54,65.0,First,Cherbourg
94,59.0,Third,Southampton
96,71.0,First,Cherbourg


In [55]:
df_new["embark_town"].value_counts()

Southampton    35
Cherbourg       9
Name: embark_town, dtype: int64

### Aggregation & Grouping
---
- count()
- first()
- last()
- mean()
- median()
- min()
- max()
- std()
- var()
- sum()
- pivot table



- **Grouping**; toplulaştırma fonksiyonlarını **groupby** metodu ile gruplayabiliriz.

In [56]:
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [57]:
df["age"].mean()

29.69911764705882

In [58]:
df.groupby("sex")["age"].mean()

sex
female    27.915709
male      30.726645
Name: age, dtype: float64

- cinsiyetlerin yaşlara göre ortalaması;

In [59]:
df.groupby("sex").agg({"age": "mean"})

Unnamed: 0_level_0,age
sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


In [60]:
df.groupby("sex").agg({"age": ["mean", "sum"]})

Unnamed: 0_level_0,age,age
Unnamed: 0_level_1,mean,sum
sex,Unnamed: 1_level_2,Unnamed: 2_level_2
female,27.915709,7286.0
male,30.726645,13919.17


In [61]:
df.groupby("sex").agg({"age": ["mean", "sum"],
                      "survived": "mean"})

Unnamed: 0_level_0,age,age,survived
Unnamed: 0_level_1,mean,sum,mean
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
female,27.915709,7286.0,0.742038
male,30.726645,13919.17,0.188908


In [63]:
df.groupby(["sex", "embark_town", "class"]).agg({"age": ["mean"],
                                                "survived": ["mean"]})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,age,survived
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,mean
sex,embark_town,class,Unnamed: 3_level_2,Unnamed: 4_level_2
female,Cherbourg,First,36.052632,0.976744
female,Cherbourg,Second,19.142857,1.0
female,Cherbourg,Third,14.0625,0.652174
female,Queenstown,First,33.0,1.0
female,Queenstown,Second,30.0,1.0
female,Queenstown,Third,22.85,0.727273
female,Southampton,First,32.704545,0.958333
female,Southampton,Second,29.719697,0.910448
female,Southampton,Third,23.223684,0.375
male,Cherbourg,First,40.111111,0.404762


- 3 kırılımlı yaş, hayatta kalma olasılıkları ve cinsiyet toplamı;

In [64]:
df.groupby(["sex", "embark_town", "class"]).agg({
    "age": ["mean"],
    "survived": "mean",
    "sex": "count"
})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,age,survived,sex
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,mean,count
sex,embark_town,class,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
female,Cherbourg,First,36.052632,0.976744,43
female,Cherbourg,Second,19.142857,1.0,7
female,Cherbourg,Third,14.0625,0.652174,23
female,Queenstown,First,33.0,1.0,1
female,Queenstown,Second,30.0,1.0,2
female,Queenstown,Third,22.85,0.727273,33
female,Southampton,First,32.704545,0.958333,48
female,Southampton,Second,29.719697,0.910448,67
female,Southampton,Third,23.223684,0.375,88
male,Cherbourg,First,40.111111,0.404762,42


### Pivot Table
---
- veri setini kırılımlar açısından değerlendirmek ve özet istatistiği bu kırılımlar açısından görmemizi sağlıyor.
- **pivot_table(_kesişimlerde neyi görmek istiyorsun?, satırda hangi değişkeni görmek istiyorsun?, sütunda hangi değişkeni görmek istiyorsun?_)**

In [2]:
import pandas as pd
import seaborn as sns
pd.set_option('display.max_columns', None)
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


- cinsiyet ve embarked değişkeni açısından survived bilgisine erişmek istersek;

In [3]:
df.pivot_table("survived", "sex", "embarked")

embarked,C,Q,S
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.876712,0.75,0.689655
male,0.305263,0.073171,0.174603


- eğer standart sapmalarını hesaplamak istersek; **aggfunc** parametresini değiştirmemiz gerekiyor.

In [4]:
df.pivot_table("survived", "sex", "embarked", aggfunc="std")

embarked,C,Q,S
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.331042,0.439155,0.463778
male,0.462962,0.263652,0.380058


In [5]:
df.pivot_table("survived", "sex", ["embarked", "class"])

embarked,C,C,C,Q,Q,Q,S,S,S
class,First,Second,Third,First,Second,Third,First,Second,Third
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
female,0.976744,1.0,0.652174,1.0,1.0,0.727273,0.958333,0.910448,0.375
male,0.404762,0.2,0.232558,0.0,0.0,0.076923,0.35443,0.154639,0.128302


In [6]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [7]:
df["new_age"] = pd.cut(df["age"], [0, 10, 18, 25, 40, 90])

- elimizdeki ssayısal değişkenleri kategorik değişkenlere çevirmek için **cut** ve **qcut** kullanıyoruz.
- elimizdeki sayısal değişkeni neye göre kategorize edeceğimizi biliyorsak **cut**, eğer bilmiyorsak **qcut** kullanıyoruz.
- **cut(neyi böleceğim?, nereden böleceğim?)**

In [8]:
df.pivot_table("survived", "sex", ["new_age", "class"])

new_age,"(0, 10]","(0, 10]","(0, 10]","(10, 18]","(10, 18]","(10, 18]","(18, 25]","(18, 25]","(18, 25]","(25, 40]","(25, 40]","(25, 40]","(40, 90]","(40, 90]","(40, 90]"
class,First,Second,Third,First,Second,Third,First,Second,Third,First,Second,Third,First,Second,Third
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
female,0.0,1.0,0.5,1.0,1.0,0.52381,0.941176,0.933333,0.5,1.0,0.90625,0.464286,0.961538,0.846154,0.111111
male,1.0,1.0,0.363636,0.666667,0.0,0.103448,0.333333,0.047619,0.115385,0.513514,0.071429,0.172043,0.28,0.095238,0.064516


### Apply & Lambda
---
- **apply**; satır ve sütun bazında otomatik olarak fonksiyonları çalıştırmayı sağlar.
    - bir döngü yazmadan değişkenler arasında dolaş.
- **lambda**; kullan at fonksiyon yapısıdır.

In [9]:
pd.set_option('display.max_columns', None)
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [10]:
df["age2"] = df["age"] * 2
df["age3"] = df["age"] * 5
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age2,age3
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,44.0,110.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,76.0,190.0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,52.0,130.0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,70.0,175.0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,70.0,175.0


- veri setindeki değişkenlere fonksiyonel işlemler uygulamak istiyorum. bunu adım adım yaparsak;

In [11]:
(df["age"] / 10).head()

0    2.2
1    3.8
2    2.6
3    3.5
4    3.5
Name: age, dtype: float64

In [12]:
(df["age2"] / 10).head()

0    4.4
1    7.6
2    5.2
3    7.0
4    7.0
Name: age2, dtype: float64

In [13]:
(df["age3"] / 10).head()

0    11.0
1    19.0
2    13.0
3    17.5
4    17.5
Name: age3, dtype: float64

- görüldüğü üzere her zaman değişkenler üzerinden tek tek işlemler gerçekleştiremeyiz. çok fazla değişken olduğunda bir döngü yazmam gerekiyor.

In [14]:
for col in df.columns:
    if "age" in col:
        print(col)

age
age2
age3


In [15]:
for col in df.columns:
    if "age" in col:
        print((df[col] / 10).head())

0    2.2
1    3.8
2    2.6
3    3.5
4    3.5
Name: age, dtype: float64
0    4.4
1    7.6
2    5.2
3    7.0
4    7.0
Name: age2, dtype: float64
0    11.0
1    19.0
2    13.0
3    17.5
4    17.5
Name: age3, dtype: float64


- yukardaki işlemde yaptığım işlemi veri setinde kaydetmedim. şimdi kaydedelim.

In [16]:
for col in df.columns:
    if "age" in col:
        df[col] = (df[col] / 10)

df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age2,age3
0,0,3,male,2.2,1,0,7.25,S,Third,man,True,,Southampton,no,False,4.4,11.0
1,1,1,female,3.8,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,7.6,19.0
2,1,3,female,2.6,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,5.2,13.0
3,1,1,female,3.5,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,7.0,17.5
4,0,3,male,3.5,0,0,8.05,S,Third,man,True,,Southampton,no,True,7.0,17.5


In [17]:
df[["age", "age2", "age3"]].apply(lambda x: x / 10).head()

Unnamed: 0,age,age2,age3
0,0.22,0.44,1.1
1,0.38,0.76,1.9
2,0.26,0.52,1.3
3,0.35,0.7,1.75
4,0.35,0.7,1.75


In [18]:
df.loc[:, df.columns.str.contains("age")].apply(lambda x: x / 10).head()

Unnamed: 0,age,age2,age3
0,0.22,0.44,1.1
1,0.38,0.76,1.9
2,0.26,0.52,1.3
3,0.35,0.7,1.75
4,0.35,0.7,1.75


In [19]:
df.loc[:, df.columns.str.contains("age")].apply(lambda x: (x - x.mean()) / x.std()).head()

Unnamed: 0,age,age2,age3
0,-0.530005,-0.530005,-0.530005
1,0.57143,0.57143,0.57143
2,-0.254646,-0.254646,-0.254646
3,0.364911,0.364911,0.364911
4,0.364911,0.364911,0.364911


In [23]:
def standart_scaler(col_name):
    return (col_name - col_name.mean()) / col_name.std()

In [24]:
df.loc[:, df.columns.str.contains("age")].apply(standart_scaler).head()

Unnamed: 0,age,age2,age3
0,-0.530005,-0.530005,-0.530005
1,0.57143,0.57143,0.57143
2,-0.254646,-0.254646,-0.254646
3,0.364911,0.364911,0.364911
4,0.364911,0.364911,0.364911


### Birleştirme İşlemleri
---
- **concat**; elimizde iki adet bulunan dataframe'i birleştirme.

In [25]:
import numpy as np
import pandas as pd
m = np.random.randint(1, 30, size=(5, 3))
df1 = pd.DataFrame(m, columns=["var1", "var2", "var3"])
df2 = df1 + 99

In [26]:
pd.concat([df1, df2])

Unnamed: 0,var1,var2,var3
0,24,20,7
1,25,22,9
2,19,22,16
3,19,5,12
4,18,28,2
0,123,119,106
1,124,121,108
2,118,121,115
3,118,104,111
4,117,127,101


- burada iki dataframe üst üste eklenmiş oldu. index bilgilerine bakarsak bunu istemeyiz. bunun için;

In [28]:
pd.concat([df1, df2], ignore_index=True)
# burada index artışını "ignore_index" parametresiyle düzenledik.

Unnamed: 0,var1,var2,var3
0,24,20,7
1,25,22,9
2,19,22,16
3,19,5,12
4,18,28,2
5,123,119,106
6,124,121,108
7,118,121,115
8,118,104,111
9,117,127,101


- **merge** ile birleştirme işlemleri;

In [29]:
df1 = pd.DataFrame({'employees': ["john", "dennis", "mark", "maria"],
                   "group": ["accounting", "engineering", "engineering", "hr"]})

In [30]:
df2 = pd.DataFrame({"employees": ["mark", "john", "dennis", "maria"],
                   "start_date": [2010, 2009, 2014, 2019]})

In [31]:
pd.merge(df1, df2)

Unnamed: 0,employees,group,start_date
0,john,accounting,2009
1,dennis,engineering,2014
2,mark,engineering,2010
3,maria,hr,2019


In [32]:
pd.merge(df1, df2, on = "employees") # hangi değişkene göre birleştirme yapacağını belirtebiliriz.

Unnamed: 0,employees,group,start_date
0,john,accounting,2009
1,dennis,engineering,2014
2,mark,engineering,2010
3,maria,hr,2019


- **Amaç**; her çalışanın müdürünün bilgisine erişmek istiyoruz;

In [33]:
df3 = pd.merge(df1, df2)

In [34]:
df4 = pd.DataFrame({"group": ["accounting", "engineering", "hr"],
                   "manager": ["Caner", "Mustafa", "Berkcan"]})

In [35]:
pd.merge(df3, df4)

Unnamed: 0,employees,group,start_date,manager
0,john,accounting,2009,Caner
1,dennis,engineering,2014,Mustafa
2,mark,engineering,2010,Mustafa
3,maria,hr,2019,Berkcan
