# Data Analysis: Educational Lab

Welcome to the Data Analysis Lab! In this educational project, we aim to explore crucial concepts that form the foundation of data manipulation and analysis. This project is based on the <a href='https://www.coursera.org/learn/data-analysis-with-python'>Data Analysis with Python</a> course on the Coursera platform. The topic discussed here relates to the second week of the course and involves the practical application of concepts using a dataset collected from Kaggle: <a href='https://www.kaggle.com/datasets/shariful07/student-mental-health'>Student Mental health</a>.

By the end of this project, we will have acquired essential skills to handle missing data, adjust data types as needed, standardize and normalize relevant attributes, visualize data through <a href='#4'>grouped bar charts using binning</a>, and convert categorical data into numerical indicator variables.

We sincerely thank Coursera for providing not only a comprehensive and accessible educational platform but also for offering financial support (<a href='https://www.coursera.org/financial-aid'>financial aid</a>) that enables participation in this course. We express our deep gratitude to the community, whose sharing of knowledge and mutual support significantly enriches the learning experience for everyone.

## <a href='#1'>Handling Missing Data</a>
Robust data analysis requires the ability to effectively deal with missing data. There are various approaches to this challenge, from <a href='#1'>excluding incomplete entries</a> to <a href='#1'>imputing missing values</a>. We will explore these techniques, highlighting their applications and practical considerations.

## <a href='#2'>Data Type Correction</a>
Consistency in data types is crucial. In this lab, we will learn to <a href='#2'>identify and correct inappropriate data types</a>, ensuring that each variable is correctly interpreted. This is essential to avoid analysis errors and misinterpretation of data.

## <a href='#3'>Standardization and Normalization</a>
Standardizing and normalizing data are essential steps for <a href='#3'>comparing attributes on different scales</a>. We will understand the difference between these processes and when to apply each. In the end, we will have data ready for <a href='#3'>more accurate and meaningful analyses</a>.

## <a href='#4'>Visualization with Grouped Bar Charts (Binning)</a>
Visualization is a powerful tool in data analysis. We will cover the <a href='#4'>binning technique</a>, which groups data into intervals, allowing for a clearer and interpretable representation. Get ready to create <a href='#4'>grouped bar charts</a> that reveal patterns and trends in your datasets.

## <a href='#5'>Categorical Data Conversion</a>
Not all algorithms can directly handle categorical data. We will learn to <a href='#5'>convert these variables into numerical indicators</a>, making them compatible with a wide range of analytical techniques. This step is vital to ensure that all aspects of your dataset contribute to robust analyses.

Throughout this project, we focus not only on performing tasks but also on understanding the reasons behind each step. Are we ready? Let's begin exploring and enhancing our data analysis skills!


# Importing the Necessary Libraries

Now, let's import the necessary libraries to start our project. Although a detailed explanation of this step is not included in this work, it is important to understand that importing libraries is a common practice in Python programming. To deepen this knowledge, we recommend exploring online resources such as the official Python documentation ([Python Documentation](https://docs.python.org/3/)) and specific tutorials available on learning platforms such as Coursera itself or sites like W3Schools. This skill will be valuable throughout the project and in future explorations in data analysis.


In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Reading the CSV File

Now, let's proceed with reading the CSV file stored locally. The path you see in the cell below corresponds to the file location on **my machine**. To continue and access the file on your machine, follow the steps below:

1. **Download the file:**
   - Visit the [Student Mental Health](https://www.kaggle.com/datasets/shariful07/student-mental-health) link on Kaggle.
   - Download the available CSV file on the page.
2. **Save the file on your machine:**
   - After downloading, save the file in an accessible location on your machine.
3. **Update the path in the code:**
   - If you saved the file in a different directory, adjust the file path in the code to reflect the correct location.

Alternatively, you can use the file I provided in the same project directory on GitHub. However, downloading the file from Kaggle is recommended as it provides additional information about the data that can be useful. This procedure is essential to ensure that the code works correctly and that you can explore the dataset effectively.


In [10]:
# Caminho do arquivo CSV localizado na máquina do autor (substitua pelo seu caminho)
file = r'D:\Jeanco\Meus projetos\students_mental_health\Student Mental health.csv'

# Leitura do arquivo CSV utilizando a biblioteca pandas
df = pd.read_csv(file)

# Visualização do DataFrame para verificar se o arquivo CSV foi lido corretamente
df

Unnamed: 0,Timestamp,Choose your gender,Age,What is your course?,Your current year of Study,What is your CGPA?,Marital status,Do you have Depression?,Do you have Anxiety?,Do you have Panic attack?,Did you seek any specialist for a treatment?
0,8/7/2020 12:02,Female,18.0,Engineering,year 1,3.00 - 3.49,No,Yes,No,Yes,No
1,8/7/2020 12:04,Male,21.0,Islamic education,year 2,3.00 - 3.49,No,No,Yes,No,No
2,8/7/2020 12:05,Male,19.0,BIT,Year 1,3.00 - 3.49,No,Yes,Yes,Yes,No
3,8/7/2020 12:06,Female,22.0,Laws,year 3,3.00 - 3.49,Yes,Yes,No,No,No
4,8/7/2020 12:13,Male,23.0,Mathemathics,year 4,3.00 - 3.49,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...
96,13/07/2020 19:56:49,Female,21.0,BCS,year 1,3.50 - 4.00,No,No,Yes,No,No
97,13/07/2020 21:21:42,Male,18.0,Engineering,Year 2,3.00 - 3.49,No,Yes,Yes,No,No
98,13/07/2020 21:22:56,Female,19.0,Nursing,Year 3,3.50 - 4.00,Yes,Yes,No,Yes,No
99,13/07/2020 21:23:57,Female,23.0,Pendidikan Islam,year 4,3.50 - 4.00,No,No,No,No,No


# Standardization of Column Naming

Before delving into the specific topics mentioned at the beginning of this project, we will choose to adopt more consistent naming conventions for the dataset columns. This practice aims to improve code readability and facilitate data manipulation, especially during programming processes. We will also use Brazilian Portuguese names for better readability in this context.

## Adopted Conventions:

1. **Lowercase and Underscores (Snake Case):**
   - Example: `submission_time`, `choose_your_gender`, `age`, `course`, etc.

## Why are we doing this?

By adopting this approach, we make column names more programming-friendly, making it easier to reference and manipulate data. Additionally, this standardization contributes to consistency in code and analysis, enhancing clarity and understanding of the dataset.

Let's apply these conventions as in the following example:

```python
# Example of column renaming
df.rename(columns={
    'Timestamp': 'timestamp',
    'Choose your gender': 'gender',
    'Age': 'age',
    'What is your course?': 'course',
    # ... (continue for other columns)
}, inplace=True)
df.head()


In [11]:
#Renomeando as colunas
df.rename(columns={
    'Timestamp': 'timestamp',
    'Choose your gender': 'gender',
    'Age': 'age',
    'What is your course?': 'course',
    'Your current year of Study': 'current_study_year',
    'What is your CGPA?': 'cgpa',
    'Marital status': 'marital_status',
    'Do you have Depression?': 'have_depression',
    'Do you have Anxiety?': 'have_anxiety',
    'Do you have Panic attack?': 'have_panic_attack',
    'Did you seek any specialist for a treatment?': 'seek_specialist_for_treatment'
}, inplace=True)

#Verificando as primeiras linhas da tabela
df.head()


Unnamed: 0,timestamp,gender,age,course,current_study_year,cgpa,marital_status,have_depression,have_anxiety,have_panic_attack,seek_specialist_for_treatment
0,8/7/2020 12:02,Female,18.0,Engineering,year 1,3.00 - 3.49,No,Yes,No,Yes,No
1,8/7/2020 12:04,Male,21.0,Islamic education,year 2,3.00 - 3.49,No,No,Yes,No,No
2,8/7/2020 12:05,Male,19.0,BIT,Year 1,3.00 - 3.49,No,Yes,Yes,Yes,No
3,8/7/2020 12:06,Female,22.0,Laws,year 3,3.00 - 3.49,Yes,Yes,No,No,No
4,8/7/2020 12:13,Male,23.0,Mathemathics,year 4,3.00 - 3.49,No,No,No,No,No


# <a href="#1">1 - Handling Missing Data</a>

**Motivation:**
   - This analysis aims to understand the presence of missing data in the dataset, crucial for accurate analyses.
   - In a more complex dataset, identifying and addressing missing data is essential for reliable analyses.

**Implications:**
   - Missing data can impact the quality and validity of analyses.
   - Understanding the distribution of missing data is fundamental for choosing appropriate treatment strategies.

This analysis is crucial to ensure that we can address missing data in an informed and effective manner in our dataset.

Next, we will analyze the data for "absences" and proceed with the treatment, providing a practical explanation of the importance of dealing with these biases.


In [12]:
dados_ausentes = df.isnull()
for coluna in dados_ausentes.columns.values.tolist():
    print(dados_ausentes[coluna].value_counts())
    print("")
df.columns.isnull()

timestamp
False    101
Name: count, dtype: int64

gender
False    101
Name: count, dtype: int64

age
False    100
True       1
Name: count, dtype: int64

course
False    101
Name: count, dtype: int64

current_study_year
False    101
Name: count, dtype: int64

cgpa
False    101
Name: count, dtype: int64

marital_status
False    101
Name: count, dtype: int64

have_depression
False    101
Name: count, dtype: int64

have_anxiety
False    101
Name: count, dtype: int64

have_panic_attack
False    101
Name: count, dtype: int64

seek_specialist_for_treatment
False    101
Name: count, dtype: int64



array([False, False, False, False, False, False, False, False, False,
       False, False])

Let's address the issue of missing data, considering that our dataset is simple for didactic purposes.

Below is a step-by-step explanation of the code used above to handle missing data:

1. **`missing_data = df.isnull()`:**
   - We create a boolean DataFrame indicating True for missing values and False for present values in `df`.
2. **`for column in missing_data.columns.values.tolist():`:**
   - We iterate through each column in the missing data DataFrame.
3. **`print(missing_data[column].value_counts())`:**
   - We count and display the quantity of missing (True) and present (False) values in each column.
4. **`print(df.columns.isnull())`:**
   - We check if there are missing data in the labels of the columns in the original DataFrame. In other words, it indicates whether any column name is empty. This can be useful in situations where there is a need to check the integrity of column labels in the DataFrame.


Now let's explain the outputs provided by our code. Let's take an example:

```
timestamp
False    101
Name: count, dtype: int64
```

Notice that we have the column name `timestamp`. The False represents the quantity of non-null data. In this case, we have 101 non-null values for the specified column.

Now, let's examine the output for the ```age``` column:

```
age
False    100
True       1
Name: count, dtype: int64
```

- **`False: 100`:** Indicates that there are 100 non-null values in the ```age``` column.
- **`True: 1`:** Indicates that there is one null (missing) value in the ```age``` column.

The other information present in the outputs is not relevant for understanding within the scope of this project.

But what is the value at the location of the null value in the ```age``` column? Let's use the `unique()` method to obtain the unique values in the column and find out.

In [13]:
print(df['age'].unique())

[18. 21. 19. 22. 23. 20. 24. nan]


The output above indicates that in the age column, we have values like 18 years, 21 years, and so on. Note the last value: nan.

Now we've identified that our null value is represented by "nan" (not a number), allowing us to confidently proceed with handling this missing data. The consistent use of "nan" as the default representation brings benefits in terms of consistency and compatibility in numerical operations. It's not necessary to deeply understand these details; the key point is that using "nan" as the missing value representation is preferable to other representations.

Understanding that we are dealing with "nan" allows us to employ specific strategies, such as substitution with the mean or most frequent value, effectively.

**Importance of Knowing the Value in Null:**

Knowing the specific value that occupies the place of null is crucial for making informed decisions on how to handle it. For example, if the null value were represented by a specific character, like "?", it would be necessary to identify and treat that character uniquely, avoiding ambiguities in numerical operations and ensuring coherence in the data.

In summary, the choice of null value representation directly impacts how we deal with these values during data analysis, and opting for "nan" provides a standardized and efficient approach.

# <a href="#1.1">1.1 - Replacement of Null Values with Mean</a>

When dealing with attributes that have continuous data, the most appropriate approach to replacing missing values is by using the mean. In our dataset, the "```age```" (age) variable is continuous, representing values like 18, 21, 19, and so on. If some of these values are missing, we can fill them with the mean of the variable.

Continuous data refers to variables that can take any value within a specific range, such as age, which can be represented by real numbers. In contrast to discrete data, which takes specific values and is often integers, continuous data can include fractions or decimal numbers.

Using the code below, we replace the missing values in the  ```age``` column with the mean of that same column:
```python
# Example code for replacing missing values with mean in the ```age`` column
mean_age = df['age'].mean()
df['age'].fillna(mean_age, inplace=True)


In [14]:
media_idade = df['age'].astype('float').mean(axis=0)
df['age'].replace(np.nan, media_idade, inplace=True)
print(df['age'].unique())

[18.   21.   19.   22.   23.   20.   24.   20.53]


Notice that now we have the value 20.53 instead of 'nan'. This value is the mean of all the ages presented in the column.

The strategy of replacing with the mean is adopted to maintain coherence in the distribution of the data.


```markdown
# <a href="#1.2">1.2 - Replacement of Null Values with Most Frequent Value</a>

### Exemplification

When dealing with attributes that contain categorical data, the recommended approach for replacing missing values is to use the most frequent value in the category. Let's consider a scenario where we have a categorical column, for example, the ```marital status``` column, and some values are missing. The appropriate strategy would be to replace these values with the most frequent marital status.

Categorical data refers to variables that represent discrete categories, such as "Single," "Married," or "Divorced." When encountering missing values in this type of attribute, it is beneficial to fill them with the most common value in the category, maintaining the integrity of the distribution.

Below is a generic code that exemplifies how to replace missing values with the most frequent value in the categorical column:

```python
most_frequent_marital_status = df['marital_status'].value_counts().idxmax()
df['marital_status'].replace(np.nan, most_frequent_marital_status, inplace=True)
```
This approach ensures that missing values are filled in a representative manner, aiming to maintain cohesion in the distribution of categorical data.