# Exploratory Data Analysis with Python
<div style="
    border: 5px solid purple;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
import pandas as pd

<div style="
    border: 3px solid purple;
    border-radius: 8px;
    padding: 12px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
     Your job is to makes sense of any dataset given and give a preliminary report.
    <ul>
      <li>What is the structure of the data?</li>
      <li>How clean is the dataset?</li>
      <li>Does it look real or was machine generated?</li>
      <li>Is it worth it to further analyse it?</li>
      <li>Are there some  interesting insights that can be pulled already?</li>
    </ul>
</div>

## The basics - Understanding a dataframe
<div style="
    border: 4px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

<div style="
    border: 3px solid orange;
    border-radius: 8px;
    padding: 12px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
A dataframe is a "size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure."
Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
</div>

### Building a dataframe from a dictionary
<div style="
    border: 2px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
mydict = {
    "names": ["Gustavo", "Henrik", "Wanja", "Carlo", "Jannik"],
    "scores": [39, 34, 40, 49, 10],
    "fav_food": ["tacos", "pasta", "cake", "döner", "ice cream"]
}

In [None]:
#pandas library
df = pd.DataFrame(mydict)

### Importing data
<div style="
    border: 2px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#from a csv file
df = pd.read_csv("datasets/socialmedia_engagement.csv")

In [None]:
#from an excel file --- need to install openpyxl dependency
df = pd.read_excel("datasets/happiness_2015-2019.xlsx")

In [None]:
#from github
username = "datagus"
repository = "statstutorial2025"
directory = "week5/airbnb_europe.csv"
github_url = f"https://raw.githubusercontent.com/{username}/{repository}/main/{directory}"
df = pd.read_csv(github_url)

In [None]:
#from a google spreadsheet
gsheet_id = "1wEGvOk504_wnFlv1D9Dw8IFIAaDMtwau"
url = f"https://docs.google.com/spreadsheets/d/{gsheet_id}/export?format=xlsx"
excel = pd.ExcelFile(url)
df = excel.parse("master table")

### Inspecting the structure of a dataframe
<div style="
    border: 2px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#how many columns and rows


In [None]:
#retriving them separately and printing the number of columns and rows


In [None]:
#another way to getting the number of rows, using len()


In [None]:
#a more detailed overview, an information overview


In [None]:
pd.set_option('display.max_columns', None) # to show all columns
#pd.reset_option('display.max_columns')
#checking the first 5 rows


In [None]:
#checking the last 5 rows


In [None]:
#checking a random slice of the dataframe


In [None]:
#checking the columns names


In [None]:
#getting the index

In [None]:
#getting some descriptive statistics for numeric


In [None]:
#getting some descriptive statistics for categories or object data types


### Quality of the dataframe
<div style="
    border: 2px solid orange;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#checking duplicates in the dataframe


In [None]:
#how many missing values


In [None]:
#checking the datatypes


In [None]:
#which SDGs do we have?


In [None]:
#how many SDGs do we have?


In [None]:
#getting a contingency table of SDGs


In [None]:
#getting a contingency table of SDGs


In [None]:
#another variable to check?


In [None]:
#saving a contigengy table into a variable for "location of the study"


In [None]:
#converting the object into a Data Frame


In [None]:
#reseting the index


## Dataframe Operations
<div style="
    border: 4px solid green;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

### Modifying the index
<div style="
    border: 2px solid green;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#making a copy of your dataset. Recommended especially if you are modifying the original df
copy_df = df.copy()

In [None]:
# putting a custom index, for example that 
#for example, starting from 100, you need to make sure, your index fits the lenght of rows


In [None]:
# if you want to reset the index


### Dropping row and columns and renaming them
<div style="
    border: 2px solid green;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#let's delete some columns, for example the first column


In [None]:
# if you want to delete some columns by their positions


In [None]:
# if you want to drop several columns


In [None]:
#checking duplicated rows based on a column, for example EID


In [None]:
#which are those duplicated in EID column


In [None]:
# dropping duplicates but from an specific column


In [None]:
# checking missing values from the abstract column


In [None]:
# dropping missing values from the abstract column


In [None]:
#renaming columns


In [None]:
#you can also rename the column based on the position


In [None]:
#you can rename several columns at once


## Index and Slicing
<div style="
    border: 4px solid blue;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>

In [None]:
#selecting a column


In [None]:
#use double square brackets to be shown with df format


In [None]:
#selecting rows by index position


In [None]:
# or use double square brackets


In [None]:
#selecting the first 20 rows with all columns


In [None]:
#selecting the first 10 rows with the first five columns


In [None]:
#select the last 10 rows with the columns 3 to 8


In [None]:
#selecting rows by label and index


In [None]:
# conditional selection, for example all articles with more than 40 citations


In [None]:
# transforming SDG column to object


In [None]:
# selecting only articles from SDG 10 and 5

## Exercise
Explore the following dataset: datasets/socialmedia_engagement

Does it look real or machine generated?
<div style="
    border: 4px solid red;
    border-radius: 8px;
    padding: 0px;
    margin: 10px 0;
    background-color: inherit;
    color: inherit;
">
</div>