
<a href="https://colab.research.google.com/github/aleylani/Databehandling/blob/main/lectures/L1_pandas_basics.ipynb" target="_parent"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> &nbsp; for interacting with the code

---
# Lecture notes - Pandas basics

---
This is the lecture note for **Pandas basics** - but it's built upon contents from previous course: 
- Python programming

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to Pandas. I encourage you to read further about pandas.

Read more 

- [documentation - Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)

- [documentation - pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html)

- [documentation - DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame)

- [documentation - read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

- [documentation - indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

- [documentation - masking](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html)

- [documentation - read_excel](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)

- [documentation - seaborn barplot](https://seaborn.pydata.org/generated/seaborn.barplot.html)
---

## Pandas Series

1D array with flexible indices. Series can be seened as a "typed dictionary". The typing makes it more efficient than dictionary in certain computations
- create from dictionary 
- create from list 
- create from array 

In [None]:
import numpy as np
import pandas as pd

In [None]:
 data = dict(AI = 25, NET = 30 , APP = 30, Java = 27) # number of students

series_programs = pd.Series(data=data)
print(series_programs)

# extract values 
print(f"series_programs[0] -> {series_programs[0]}")
print(f"series_programs[-1] -> {series_programs[-1]}")

# get the keys
print(f"series_programs.keys() -> {series_programs.keys()}") 
print(f"series_programs.keys()[2] -> {series_programs.keys()[2]}") 

In [None]:
import random as rnd
rnd.seed(42)

# create Series using list
dice_series = pd.Series([rnd.randint(1,6) for _ in range(5)])
print(dice_series)

# some useful methods
print(f"Min value {dice_series.min()}")
print(f"Mean value {dice_series.mean()}")
print(f"Median value {dice_series.median()}")

---
## DataFrame
Analog of 2D Numpy array with flexible row indices and col names. Can also be seened as specialized dictionary where each col name is mapped to a Series object. 

- notice that for all operations on DataFrames, we get a return value, which means that you have to assign it to a variable for the changes to persist

In [None]:
df_programs = pd.DataFrame(series_programs,columns=("Num students",))
df_programs

In [None]:
# create 2 Series objects using dictionary
students = pd.Series(dict(AI = 25, NET = 30 , APP = 30, Java = 27))
language = pd.Series(dict(AI="Python", NET="C#", APP="Kotlin", Java = "Java"))

# create a DataFrame from 2 Series objects using dictionary
df_programs = pd.DataFrame({"Students":students, "Language":language}) # key becomes col name
df_programs

from array

In [None]:
df_programs = pd.DataFrame({
    "Students": np.array((25, 30, 30, 27)),
    "Language": np.array(("Python", "C#", "Kotlin", "Java"))})
df_programs

from list

In [None]:
df_programs = pd.DataFrame({
    "Students": [25, 30, 30, 27],
    "Language": ["Python", "C#", "Kotlin", "Java"]},
    index = ["AI", ".NET", "APP", "Java"])
df_programs

In [None]:
import numpy as np
# can also be created directly
df_programs = pd.DataFrame({
    "Students": np.array((25, 30, 30, 27)),
    "Language": np.array(("Python", "C#", "Kotlin", "Java"))},
    index = ["AI", ".NET", "APP", "Java"])
df_programs

In [None]:
df_programs.index # dtype object is used for text or mixed numeric or non-numeric values

---
## Data selection
- dictionary-style indexing
- attribute-style indexing
    - can give unexpected errors as some methods can share same name as col name   

In [None]:
# gives a Series object of Students 
df_programs["Students"] # dictionary indexing

In [None]:
# select multiple columns using list 
df_programs[["Language", "Students"]]

In [None]:
df_programs.Language # attribute indexing

In [None]:
df_programs["Language"][".NET"] # selects the Language Series and indexes .NET

---
## Indexers
Gives a slicing interface for the indices. loc and iloc are attributes of Series and DataFrame objects.

| Indexer | Description                                         |
| :-----: | --------------------------------------------------- |
|   loc   | slicing and indexing referencing explicit index     |
|  iloc   | slicing and indexing referencing Python-style index |

In [None]:
print(df_programs.loc["Java"])

# index multiple rows
df_programs.loc[["Java", "APP"]]

In [None]:
# slicing with array-style indices
df_programs.iloc[1:3]

---
## Masking
Replaces values where the condition is True

```py
df = df[conditions]

In [None]:
print(df_programs["Students"] > 25) # this gives a pandas Series of type bool 

df_over_25 = df_programs[df_programs["Students"]>25]
df_over_25

---
## Read excel data
- reads an .xlsx-file and stores it as DataFrame object

Data comes from: [kaggle calory data](https://www.kaggle.com/kkhandekar/calories-in-food-items-per-100-grams)

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns # used for plotting 

df = pd.read_excel("../Data/calories.xlsx")
df.head() # see the first n rows of DataFrame, n = 5 by default 

In [None]:
df.info() # info about df (dtypes, non-null values and memory usage)

In [None]:
df["FoodCategory"].unique() # see liquids and solid foods

In [None]:
df["per100grams"].unique()

---
## Data cleaning 

- notice that all data types are object, need to type convert to int
Strategy
- change column names
- convert Cals_per100grams to int to make calculations with it
- separate into liquids and solid dfs 

In [None]:
df = df.rename(dict(Cals_per100grams="Calories",
               per100grams="per100", KJ_per100grams="kJ"), axis="columns")
df.head()


In [None]:
# convert Calories to int 
df["Calories"] = df["Calories"].str[:-3].astype(int)
df.head()

In [None]:
# check number of values in solids and liquids
df["per100"].value_counts()

In [None]:
liquids = df[df["per100"] == "100ml"]
liquids.head(2)

In [None]:
solids = df[df["per100"] == "100g"]
solids.head(2)

---
## Find out top 5 categories of highest calories

In [None]:
solids_sorted = solids.sort_values(by="Calories", ascending=False) # sorting descending by Calories column
solids_top5 = solids_sorted.iloc[:5] # Python-way slicing
solids_top5

In [None]:
liquids_top5 = liquids.sort_values(by="Calories", ascending=False).head()
liquids_top5

In [None]:
# top five food categories 
top5_median = df.groupby("FoodCategory").median().sort_values(by="Calories", ascending=False).head().reset_index()
top5_median

In [None]:
# visualization using seaborn
fig, axes = plt.subplots(1,3, dpi=120, figsize=(16,4))

titles = ["Solid top 5", "Liquids top 5", "Top 5 per group median"]
data_frames = [solids_top5, liquids_top5, top5_median]
x_column = ["FoodItem", "FoodItem", "FoodCategory"]

for i, (data, title) in enumerate(zip(data_frames, titles)): 
    sns.barplot(data = data, x=x_column[i], y="Calories", ax = axes[i])
    axes[i].set(title=title)
    axes[i].set_xticklabels(axes[i].get_xticklabels(), rotation=90)
plt.savefig("../Visualisations/Calories.png", facecolor="white", bbox_inches="tight")