# Import the libraries #

In [None]:
import numpy as np
import pandas as pd

## Why we use **as**? ##

In Python, the keyword as is used to create an alias (a shorter, alternative name) during the import process. This alias is then used within the current scope (usually a script or notebook) to reference the imported module or function more conveniently.
## Why use aliases? ##
1. **Simplification**: Aliases allow for shorter and easier-to-remember names compared to the full module path.
2. **Readability**: They improve code readability, especially for modules with long names (e.g., `matplotlib.pyplot` becomes the more concise plt).
3. **Convection**: Using aliases is a common practice in Python, making the code more consistent and readable for others familiar with the data science community's conventions.

Once you create an alias, you can access any function within the imported module using the alias followed by a dot and the function name:
```python
alias_name.function_name()
```

### Example ###
```python
import seaborn as sns # sns as an alias for seaborn (it is a library for ploting just like matplotlib)
sns.set_theme(style="ticks") # set the them of the graphic to `ticks`

# Import the dataset #

In [3]:
df = pd.read_csv('datasets/Employee.csv')

## What is a df? ##

A `df` is an acronym  for `dataframe`. It's a data structure organized into rows and columns, similar to a spreadsheet. However, DataFrames are more efficient and flexible, capable of handling various data types (numbers, text, categories, etc.) for complex data analysis.

## Why do we type `df` instead of `dataframe`? #
1. **Brevity**: Typing df is shorter and quicker.
2. **Convention**: It's a widely adopted convention in the data science community, making code more readable for others.

> Note: In this specific case, the data is stored in a CSV file (a comma-separated values file). However pandas can work with other formats like XML, JSON, HDF, SQL).

# Know your data #
We have imported the neccesary libraries, read the data, and stored it in a variable. **What's next?** Now we need to understand what we are dealing with.

Many datasets are too large so one useful function from pandas is the `columns` attribute. It returns the column names of the dataframe.

In [11]:
df.columns

Index(['Education', 'JoiningYear', 'City', 'PaymentTier', 'Age', 'Gender',
       'EverBenched', 'ExperienceInCurrentDomain', 'LeaveOrNot'],
      dtype='object')

While this works, it is a little harsh to see. Even though we know the name of the columns, we do not really know what each columns contains. So first, we are going to see the column names and their types. One naive approach is to iterate over each column and check its type.

In [23]:
for column in df.columns:
    print(f'{column}\t{type(column)}')

Education	<class 'str'>
JoiningYear	<class 'str'>
City	<class 'str'>
PaymentTier	<class 'str'>
Age	<class 'str'>
Gender	<class 'str'>
EverBenched	<class 'str'>
ExperienceInCurrentDomain	<class 'str'>
LeaveOrNot	<class 'str'>


However, this approach returns incorrect data types because `type(col)` checks the type of the column name, not the data within the column. In this case, all column names are strings, so the output shows `<class 'str'> `for each column.

## Accessing a Column in `df` ##
What if we access to the `df` at a certain column?

In [19]:
df['Education']

0       Bachelors
1       Bachelors
2       Bachelors
3         Masters
4         Masters
          ...    
4648    Bachelors
4649      Masters
4650      Masters
4651    Bachelors
4652    Bachelors
Name: Education, Length: 4653, dtype: object

This returns a `Series` of the column. So, whats a `Series`? According to the official documentation:

> One-dimensional ndarray with axis labels (including time series).

In simple words it is an array of any type. Now we can see some of the data in the column `Education` (the first 5 and the last 5). So, what if we access to the first element and check its type?

In [22]:
df['Education'][0]

'Bachelors'

Now let's replicate this for every column:



In [27]:
for column in df.columns:
    print(f'{column}\t', type(df[column][0]))

Education	 <class 'str'>
JoiningYear	 <class 'numpy.int64'>
City	 <class 'str'>
PaymentTier	 <class 'numpy.int64'>
Age	 <class 'numpy.int64'>
Gender	 <class 'str'>
EverBenched	 <class 'str'>
ExperienceInCurrentDomain	 <class 'numpy.int64'>
LeaveOrNot	 <class 'numpy.int64'>


This works; however, we now face another problem: what if the data type changes at some point?

Let me explain. In the code above, we check only the first element, hoping that the rest of the elements have the same type. But what if an element in the column `Education` is a number at some point, or the column contains only numbers and the first element is different (this is called an `outlier`; we will explore outliers in detail later)?

We need a more efficient way to check the data types. We could iterate over every element in each column and check their types.

In [51]:
types = {}
for column in df.columns:
    for element in df[column]:
        element_type = type(element)
        if column not in types:
            types[column] = [element_type]
        elif element_type not in types[column]:
            types[column].append(element_type)
for key, value in types.items():
    print(f'{key}: {value}')

Education: [<class 'str'>]
JoiningYear: [<class 'int'>]
City: [<class 'str'>]
PaymentTier: [<class 'int'>]
Age: [<class 'int'>]
Gender: [<class 'str'>]
EverBenched: [<class 'str'>]
ExperienceInCurrentDomain: [<class 'int'>]
LeaveOrNot: [<class 'int'>]


First, we define a dictionary. Then, we iterate over each element in each column, saving the type of each element. We check if the column is already in the dictionary. If it is not, we add a new key (the column name) with the type of the current element. If the column is already in the dictionary, we check if the type is different from the previously recorded types. If it is, we append the new type to the list.

This is a quick resume of dictionaries

```python
# Dictionaries in python works like an hashmap. A hashmap store the items in `key/value` pairs.
this_is_a_dictionary = {
    "Key1": "Value1",
    "Key2": False,
    "Key3": 1,
    "Key4": [1, False, "dict"],
    1: 'dict'
}
print(this_is_a_dictionary) # Output: {'Key1': 'Value1', 'Key2': False, 'Key3': 1, 'Key4': [1, False, 'dict'], 1: 'dict'}

# Accessing elements by key, similar to array indexing
print(this_is_a_dictionary['Key1']) # Output: Value1
print(this_is_a_dictionary[1]) # Output: dict
```

The manual method of checking data types is functional but not the most efficient. Fortunately, Pandas provides a built-in function called `dtypes` that performs this task more effectively. According to the official documentation:

> This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the object dtype.

In simpler terms, `dtypes` performs the same operation as the manual method but automatically returns `object` for columns with mixed types.

In [52]:
df.dtypes

Education                    object
JoiningYear                   int64
City                         object
PaymentTier                   int64
Age                           int64
Gender                       object
EverBenched                  object
ExperienceInCurrentDomain     int64
LeaveOrNot                    int64
dtype: object

In [4]:
df.describe()

Unnamed: 0,JoiningYear,PaymentTier,Age,ExperienceInCurrentDomain,LeaveOrNot
count,4653.0,4653.0,4653.0,4653.0,4653.0
mean,2015.06297,2.698259,29.393295,2.905652,0.343864
std,1.863377,0.561435,4.826087,1.55824,0.475047
min,2012.0,1.0,22.0,0.0,0.0
25%,2013.0,3.0,26.0,2.0,0.0
50%,2015.0,3.0,28.0,3.0,0.0
75%,2017.0,3.0,32.0,4.0,1.0
max,2018.0,3.0,41.0,7.0,1.0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4653 entries, 0 to 4652
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Education                  4653 non-null   object
 1   JoiningYear                4653 non-null   int64 
 2   City                       4653 non-null   object
 3   PaymentTier                4653 non-null   int64 
 4   Age                        4653 non-null   int64 
 5   Gender                     4653 non-null   object
 6   EverBenched                4653 non-null   object
 7   ExperienceInCurrentDomain  4653 non-null   int64 
 8   LeaveOrNot                 4653 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 327.3+ KB
