# Exploratory Data Analysis with Python
<div style="
    border: 5px solid purple;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

Please access your jupyter notebook through moodle. Make sure you are using the right kernel where the DataX team already uploaded the necessary packages and libraries

<code>Pandas</code>is the main the library to work with data analysis with is fundamental for data visualization

## Importing libraries
<div style="
    border: 4px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">

Normally you import all the necessary libraries at the beginning. When loading the pandas libraries, you have to put the abbreviation <code>pd</code>

In [None]:
import pandas as pd
import openpyxl #to open and write excel files

## The basics - Understanding a dataframe
<div style="
    border: 4px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

<div style="
    border: 3px solid orange;
    border-radius: 8px;
    padding: 12px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
A dataframe is a "size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure."
Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
</div>

### Building a dataframe from a dictionary
<div style="
    border: 2px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
mydict = {
    "names": ["Gustavo", "Henrik", "Wanja", "Carlo", "Jannik"],
    "scores": [39, 34, 40, 49, 10],
    "fav_food": ["tacos", "pasta", "cake", "d√∂ner", "ice cream"]
}

In [None]:
#pandas library
df = pd.DataFrame(mydict)

The function <code>pd.DataFrame</code> is used to convert certain objects into a dataframe

## Loading files
<div style="
    border: 4px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">

Along the course we will be working with data that is loaded in google sheets. You can download it directly, but I recommend you to import the data directly as shown below

The urls share a similar structure, the only thing that will be changing with different google spreadsheets are the <code>url_id</code>

In [None]:
base_url = "https://docs.google.com/spreadsheets/d/"
url_id = "1DI-4F8_UfqdzqEsUBJ-QCIBe26Mx0K-1tbIFVGKvBsQ/"
export = "export/format=excel"

whole_url = base_url + url_id + export

The functions <code>read_excel</code> and <code>read_csv</code> are normally used to import local files and data from urls

In [None]:
df = pd.read_excel(whole_url)

You can print the dataframe using the function print(), but it will look weird

### Inspecting the structure of a dataframe
<div style="
    border: 2px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#how many columns and rows
df.shape

In [None]:
#retriving them separately and printing the number of columns and rows
colnums = df.shape[1]
print(f"the number of columns is {colnums}")

In [None]:
#another way to getting the number of rows, using len()
len(df)

In [None]:
#a more detailed overview, an information overview
df.info()

In [None]:
pd.set_option('display.max_columns', None) # to show all columns
#pd.reset_option('display.max_columns')
#checking the first 5 rows
df.head(5)

In [None]:
#checking the last 5 rows
df.tail(5)

In [None]:
#checking a random slice of the dataframe
df.sample(5)

In [None]:
#checking the columns names
df.columns

In [None]:
#accessing a specific column name
df.columns[5]

In [None]:
#getting some descriptive statistics for numeric
df.describe()

In [None]:
#getting some descriptive statistics for categories or object data types
df.describe(include="object")

### Quality of the dataframe
<div style="
    border: 2px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#checking duplicates in the dataframe
df.duplicated().sum()
print(f"this dataset has {int(df.duplicated().sum())} duplicates")

In [None]:
#how many missing values
df.isna().sum()

In [None]:
#checking the datatypes
df.dtypes

In [None]:
#checking unique values in a column, for example in column "base shape"
df["base shape"].unique()

In [None]:
#checking how many unique values are in column "base shape"
df["base shape"].nunique()

In [None]:
#getting a contingency table of the column "base shape"
df["base shape"].value_counts()

## Dataframe Operations
<div style="
    border: 4px solid green;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

### Modifying the index
<div style="
    border: 2px solid green;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#making a copy of your dataset. Recommended especially if you are modifying the original df
copy_df = df.copy()

In [None]:
# putting a custom index, for example that 
#for example, starting from 100, you need to make sure, your index fits the lenght of rows
copy_df.index = range(1, len(df)+1)
copy_df.head(5)

In [None]:
# if you want to reset the index
copy_df = copy_df.reset_index(drop=True) #if you use drop=False the index will be a new column in your dataframe
copy_df.head()

### Dropping rows and columns and renaming them
<div style="
    border: 2px solid green;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#let's delete some columns, for example the first column
df = df.drop(columns="in stock")

In [None]:
# if you want to delete some columns
df = df.drop(df.columns[1:3], axis=1)

In [None]:
#your want to drop several columns
columns_to_drop = ["base dimensions", "has slope?"]
df = df.drop(columns=columns_to_drop)

In [None]:
df.info()

In [None]:
#dropping rows based on the index
df = df.drop(df.index[0:10], axis=0)