# Pandas

This notebook is created by Eda AYDIN through by Udemy, DATAI Team.

## Introduction to Pandas

- It is effective and fast for dataframes.
- Switching between files (csv - text) is very easy.
- Pandas library makes it easy for us to deal with missing data.
- The data can be used more effectively by doing reshape().
- In Pandas library, slicing and indexing is simple.
- Pandas library is very helpful in time series data analysis.
- **The most important thing is Pandas is an optimized fast library.**

In [1]:
import pandas as pd

In [2]:
dictionary = {
    "Name": ["Laura","Marta","Micheal","Alex"],
    "Age":[32,45,30,35],
    "Salary":[15000,35000,25000,36000]
}

### pandas.DataFrame

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

In [3]:
df = pd.DataFrame(dictionary)

In [4]:
df

Unnamed: 0,Name,Age,Salary
0,Laura,32,15000
1,Marta,45,35000
2,Micheal,30,25000
3,Alex,35,36000


### pandas.DataFrame.head

Return the first n rows.(n : int, default = 5)

This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].

In [5]:
df.head(2)

Unnamed: 0,Name,Age,Salary
0,Laura,32,15000
1,Marta,45,35000


### pandas.DataFrame.tail

Return the last n rows. (n: int, default = 5)

This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.

For negative values of n, this function returns all rows except the first n rows, equivalent to df[n:].

In [6]:
df.tail(2)

Unnamed: 0,Name,Age,Salary
2,Micheal,30,25000
3,Alex,35,36000


## Pandas Basic Methods

### pandas.DataFrame.info

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   Salary  4 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 224.0+ bytes


- **4 non-null :** The Name column in the dataframe includes non-empty data.
- **object:** It has the same meaning as String.
- **dtypes:** It means data types.

### pandas.DataFrame.describe

Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

In [8]:
df.describe()

Unnamed: 0,Age,Salary
count,4.0,4.0
mean,35.5,27750.0
std,6.658328,9844.626284
min,30.0,15000.0
25%,31.5,22500.0
50%,33.5,30000.0
75%,37.5,35250.0
max,45.0,36000.0


In the describe()  method, only the statistical analysis of the columns with **numerical data** takes place.

## Indexing and Slicing Data

In [9]:
df["Name"]

0      Laura
1      Marta
2    Micheal
3       Alex
Name: Name, dtype: object

In [10]:
df["Gender"] = ["Female","Female","Male","Male"]

In [11]:
df

Unnamed: 0,Name,Age,Salary,Gender
0,Laura,32,15000,Female
1,Marta,45,35000,Female
2,Micheal,30,25000,Male
3,Alex,35,36000,Male


### pandas.DataFrame.loc

Access a group of rows and columns by label(s) or a boolean array.

- loc[] is primarily label based, but may also be used with a boolean array.

Allowed inputs are:
- A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
- A list or array of labels, e.g. ['a', 'b', 'c'].
- A slice object with labels, e.g. 'a':'f'.

***Warning: Note that contrary to usual python slices, both the start and the stop are included.***

- A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
- An alignable boolean Series. The index of the key will be aligned before masking.
- An alignable Index. The Index of the returned selection will be the input.
- A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)

In [12]:
df.loc[:,"Name"]

# df.loc[all columns, specific column]

0      Laura
1      Marta
2    Micheal
3       Alex
Name: Name, dtype: object

In [13]:
df.loc[:2,"Name"]

# df.loc[start_index:end_index, specific column]
# Remember: End_index is inclusive for the pandas library.

0      Laura
1      Marta
2    Micheal
Name: Name, dtype: object

In [14]:
df.loc[:2, ["Name","Gender"]]

Unnamed: 0,Name,Gender
0,Laura,Female
1,Marta,Female
2,Micheal,Male


In [15]:
df.loc[:2,"Name":"Gender"]

Unnamed: 0,Name,Age,Salary,Gender
0,Laura,32,15000,Female
1,Marta,45,35000,Female
2,Micheal,30,25000,Male


In [16]:
# reverse
df.loc[::-1,:]
# df.loc[reverse all rows, get the all columns]

Unnamed: 0,Name,Age,Salary,Gender
3,Alex,35,36000,Male
2,Micheal,30,25000,Male
1,Marta,45,35000,Female
0,Laura,32,15000,Female


In [17]:
df.loc[:,:"Salary"]
# df.loc[get all rows, get all column until Salary column(inclusive)]

Unnamed: 0,Name,Age,Salary
0,Laura,32,15000
1,Marta,45,35000
2,Micheal,30,25000
3,Alex,35,36000


### pandas.DataFrame.iloc

Purely integer-location based indexing for selection by position.

.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

Allowed inputs are:
- An integer, e.g. 5.
- A list or array of integers, e.g. [4, 3, 0].
- A slice object with ints, e.g. 1:7.
- A boolean array.
- A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.

.iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).

In [18]:
df.iloc[:,0]
# df.iloc[get the all rows, get the data of index 0 column]

0      Laura
1      Marta
2    Micheal
3       Alex
Name: Name, dtype: object

In [19]:
df.iloc[[0,1]]

Unnamed: 0,Name,Age,Salary,Gender
0,Laura,32,15000,Female
1,Marta,45,35000,Female


In [20]:
df.iloc[:3]

Unnamed: 0,Name,Age,Salary,Gender
0,Laura,32,15000,Female
1,Marta,45,35000,Female
2,Micheal,30,25000,Male


In [21]:
df.iloc[:,0:2]

Unnamed: 0,Name,Age
0,Laura,32
1,Marta,45
2,Micheal,30
3,Alex,35


## Filtering Pandas Data Frame

In [22]:
df.Salary > 20000

0    False
1     True
2     True
3     True
Name: Salary, dtype: bool

In [23]:
type(df.Salary > 20000)

pandas.core.series.Series

In [24]:
filter1 = df[df.Salary > 20000]
filter1

Unnamed: 0,Name,Age,Salary,Gender
1,Marta,45,35000,Female
2,Micheal,30,25000,Male
3,Alex,35,36000,Male


In [25]:
type(filter1)

pandas.core.frame.DataFrame

In [26]:
filter2 = df[(df.Salary >20000) & (df.Age <=30)] #don't forget the parentheses.
filter2

Unnamed: 0,Name,Age,Salary,Gender
2,Micheal,30,25000,Male


In [27]:
df[df.Age > 40]

Unnamed: 0,Name,Age,Salary,Gender
1,Marta,45,35000,Female


## List Comprehension

### pandas.DataFrame.mean

DataFrame.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

Return the mean of the values over the requested axis.

In [28]:
average_salary = df.Salary.mean()
average_salary

27750.0

### numpy.mean

numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>, *, where=<no value>)[source]

Compute the arithmetic mean along the specified axis.

Returns the average of the array elements. The average is taken over the flattened array by default, otherwise over the specified axis. float64 intermediate and return values are used for integer inputs.

In [29]:
import numpy as np

In [30]:
np.mean(df.Salary)

27750.0

In [31]:
df["SalaryLevel"] = ["high" if average_salary < i else "low" for i in df.Salary]
df

Unnamed: 0,Name,Age,Salary,Gender,SalaryLevel
0,Laura,32,15000,Female,low
1,Marta,45,35000,Female,high
2,Micheal,30,25000,Male,low
3,Alex,35,36000,Male,high


In [32]:
df.columns = [i.lower() for i in df.columns]
df

Unnamed: 0,name,age,salary,gender,salarylevel
0,Laura,32,15000,Female,low
1,Marta,45,35000,Female,high
2,Micheal,30,25000,Male,low
3,Alex,35,36000,Male,high


In [33]:
df.columns = [i.upper() for i in df.columns]
df

Unnamed: 0,NAME,AGE,SALARY,GENDER,SALARYLEVEL
0,Laura,32,15000,Female,low
1,Marta,45,35000,Female,high
2,Micheal,30,25000,Male,low
3,Alex,35,36000,Male,high


In [34]:
df.columns = [i.capitalize() for i in df.columns]
df

Unnamed: 0,Name,Age,Salary,Gender,Salarylevel
0,Laura,32,15000,Female,low
1,Marta,45,35000,Female,high
2,Micheal,30,25000,Male,low
3,Alex,35,36000,Male,high


## Concatenating Data

### pandas.DataFrame.drop

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')[source]

Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide <advanced.shown_levels> for more information about the now unused levels.

In [35]:
df["SalaryLevel"] = ["high" if average_salary < i else "low" for i in df.Salary]
df

Unnamed: 0,Name,Age,Salary,Gender,Salarylevel,SalaryLevel
0,Laura,32,15000,Female,low,low
1,Marta,45,35000,Female,high,high
2,Micheal,30,25000,Male,low,low
3,Alex,35,36000,Male,high,high


In [39]:
df = df.drop(columns="SalaryLevel")
df

Unnamed: 0,Name,Age,Salary,Gender,Salarylevel
0,Laura,32,15000,Female,low
1,Marta,45,35000,Female,high
2,Micheal,30,25000,Male,low
3,Alex,35,36000,Male,high


In [40]:
df["SalaryLevel"] = ["high" if average_salary < i else "low" for i in df.Salary]
df

Unnamed: 0,Name,Age,Salary,Gender,Salarylevel,SalaryLevel
0,Laura,32,15000,Female,low,low
1,Marta,45,35000,Female,high,high
2,Micheal,30,25000,Male,low,low
3,Alex,35,36000,Male,high,high


In [41]:
df.drop(["SalaryLevel"], axis = 1, inplace= True)
# axis: { 0 or "index", 1 or "columns"}, default = 0
df

Unnamed: 0,Name,Age,Salary,Gender,Salarylevel
0,Laura,32,15000,Female,low
1,Marta,45,35000,Female,high
2,Micheal,30,25000,Male,low
3,Alex,35,36000,Male,high


### pandas.concat

pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)[source]

Concatenate pandas objects along a particular axis with optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

In [42]:
dictionary2 = {
    "Name": ["Liam","Noah","Oliver","Ava"],
    "Age":[25,20,27,26],
    "Salary":[12000,7500,16000,14000],
    "Gender":["Male","Male","Male","Female"]
}

In [45]:
df2 = pd.DataFrame(dictionary2)
df2

Unnamed: 0,Name,Age,Salary,Gender
0,Liam,25,12000,Male
1,Noah,20,7500,Male
2,Oliver,27,16000,Male
3,Ava,26,14000,Female


In [49]:
df_new = pd.concat([df,df2])
df_new

Unnamed: 0,Name,Age,Salary,Gender,Salarylevel
0,Laura,32,15000,Female,low
1,Marta,45,35000,Female,high
2,Micheal,30,25000,Male,low
3,Alex,35,36000,Male,high
0,Liam,25,12000,Male,
1,Noah,20,7500,Male,
2,Oliver,27,16000,Male,
3,Ava,26,14000,Female,


In [50]:
df_new["Salarylevel"] = ["high" if average_salary < i else "low" for i in df_new.Salary]
df_new

Unnamed: 0,Name,Age,Salary,Gender,Salarylevel
0,Laura,32,15000,Female,low
1,Marta,45,35000,Female,high
2,Micheal,30,25000,Male,low
3,Alex,35,36000,Male,high
0,Liam,25,12000,Male,low
1,Noah,20,7500,Male,low
2,Oliver,27,16000,Male,low
3,Ava,26,14000,Female,low


## Transforming Data

In [61]:
# create a list of our conditions
conditions = [
    (0 < df_new["Age"]) & (df_new["Age"] <= 2),
    (3 < df_new["Age"]) & (df_new["Age"] <= 39),
    (40 < df_new["Age"]) & (df_new["Age"] <= 59),
    (60 < df_new["Age"]) & (df_new["Age"] <= 99)
]

# create a list of the values we want to assign for each condition
values = ["Baby","Young Adults","Middle-aged Adults","Old Adults"]

In [62]:
df_new["AgeGroup"] = np.select(conditions,values)

In [63]:
df_new

Unnamed: 0,Name,Age,Salary,Gender,Salarylevel,AgeGroup
0,Laura,32,15000,Female,low,Young Adults
1,Marta,45,35000,Female,high,Middle-aged Adults
2,Micheal,30,25000,Male,low,Young Adults
3,Alex,35,36000,Male,high,Young Adults
0,Liam,25,12000,Male,low,Young Adults
1,Noah,20,7500,Male,low,Young Adults
2,Oliver,27,16000,Male,low,Young Adults
3,Ava,26,14000,Female,low,Young Adults


In [64]:
def increaseSalary(salary):
    return salary * 2

In [65]:
df_new["NewSalary"]= df_new.Salary.apply(increaseSalary)
df_new

Unnamed: 0,Name,Age,Salary,Gender,Salarylevel,AgeGroup,NewSalary
0,Laura,32,15000,Female,low,Young Adults,30000
1,Marta,45,35000,Female,high,Middle-aged Adults,70000
2,Micheal,30,25000,Male,low,Young Adults,50000
3,Alex,35,36000,Male,high,Young Adults,72000
0,Liam,25,12000,Male,low,Young Adults,24000
1,Noah,20,7500,Male,low,Young Adults,15000
2,Oliver,27,16000,Male,low,Young Adults,32000
3,Ava,26,14000,Female,low,Young Adults,28000


In [66]:
df_new["Salary"] = df_new["NewSalary"]
df_new

Unnamed: 0,Name,Age,Salary,Gender,Salarylevel,AgeGroup,NewSalary
0,Laura,32,30000,Female,low,Young Adults,30000
1,Marta,45,70000,Female,high,Middle-aged Adults,70000
2,Micheal,30,50000,Male,low,Young Adults,50000
3,Alex,35,72000,Male,high,Young Adults,72000
0,Liam,25,24000,Male,low,Young Adults,24000
1,Noah,20,15000,Male,low,Young Adults,15000
2,Oliver,27,32000,Male,low,Young Adults,32000
3,Ava,26,28000,Female,low,Young Adults,28000


In [68]:
df_new.drop(columns="NewSalary",inplace=True)
df_new

Unnamed: 0,Name,Age,Salary,Gender,Salarylevel,AgeGroup
0,Laura,32,30000,Female,low,Young Adults
1,Marta,45,70000,Female,high,Middle-aged Adults
2,Micheal,30,50000,Male,low,Young Adults
3,Alex,35,72000,Male,high,Young Adults
0,Liam,25,24000,Male,low,Young Adults
1,Noah,20,15000,Male,low,Young Adults
2,Oliver,27,32000,Male,low,Young Adults
3,Ava,26,28000,Female,low,Young Adults


In [70]:
average_salary = df_new.Salary.mean()

df_new["Salarylevel"] = ["high" if average_salary < i else "low" for i in df_new.Salary]
df_new

Unnamed: 0,Name,Age,Salary,Gender,Salarylevel,AgeGroup
0,Laura,32,30000,Female,low,Young Adults
1,Marta,45,70000,Female,high,Middle-aged Adults
2,Micheal,30,50000,Male,high,Young Adults
3,Alex,35,72000,Male,high,Young Adults
0,Liam,25,24000,Male,low,Young Adults
1,Noah,20,15000,Male,low,Young Adults
2,Oliver,27,32000,Male,low,Young Adults
3,Ava,26,28000,Female,low,Young Adults


# Resources

- [Python: Yapay Zeka için Python Programlama (1)](https://www.udemy.com/course/python-sfrdan-uzmanlga-programlama-1/?src=sac&kw=python+i%C3%A7in+yapay+zeka)
- [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html)
- [NumPy Documentation](https://numpy.org/doc/stable/reference/index.html)