# Data Cleaning

When I downloaded this data, I opened it in Excel and by first glanced realized that the dataset needed some cleaning. 

1. Before manually fixing any points, I used the duplicate function in Excel to make sure there were no duplicated data points.
2. I made the courses and year of study columns all uppercase to eliminate confusion.
3. Next, I made the age column numerical.
4. Then, I used pivot tables to make sure that there were no words misspelled in each column. Like if "Female" was accidentally put in as "Femela", this would show up as a separate category with a much lower frequency. Thankfully, there were no misspelled words in the whole table. However, there was one age that was missing. I fixed this by assigning it as the average age of the rest of the students.
5. Lastly, I changed the names of the columns because I knew that the SQL IDE would not be a fan of the original column names and these new ones would be easier for me to use during analysis.

**Before Cleaning**
![image.png](attachment:af203217-71b7-4260-a7eb-b78bc259d4c3.png)

**After Cleaning**
![image.png](attachment:b31bc2e2-6c2c-4371-b296-796d06a57303.png)

# SQL Analysis
This project was originally published as a curated project for SQL students, but after analysis I felt that I could dig deeper with Python or R using visuals. This first portion is my SQL analysis, including both queries and output.

## Introudction
This data set represents answers from university students asking them about their demographics, current educational status, and mental health. We are ask to look for trends related to mental health. 

![image.png](attachment:c01e9cdc-36ce-46e9-91dd-af19e59b5412.png)

I chose to do the query so I got a sample from the middle of the dataset and a limited amount, to not be too overwhelmed:

    SELECT * FROM STM
    LIMIT 10 OFFSET 10

We can see there is a Timestamp column. This has nothing to do with the student's mental health or related to their educational status. Therefore we can eliminate it:

    ALTER TABLE stm
    DROP COLUMN Timestamp

Then, we can select a parameter to go against the CGPA and also COUNT(CGPAs)

# Python/R Analysis
## Introduction

After my SQL analysis, I felt that this data could be easily understood through visualization. I will also work with this data in Tableau, but I thought some exploratory analysis through either Python or R will not only show some more trends between these students and mental health, but also inspire some visualization ideas.

In [7]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


df = pd.read_csv("Student Mental health.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Timestamp     101 non-null    object
 1   gender        101 non-null    object
 2   age           101 non-null    int64 
 3   course        101 non-null    object
 4   study_year    101 non-null    object
 5   CGPA          101 non-null    object
 6   marital_stat  101 non-null    object
 7   depression    101 non-null    object
 8   anxiety       101 non-null    object
 9   panic_att     101 non-null    object
 10  specialist    101 non-null    object
dtypes: int64(1), object(10)
memory usage: 8.8+ KB


## Python Data Analysis
## Questions to ask:

1. Which [...] has the highest/lowest CGPA?
    - Gender
    - Age
    - Course
    - Year
2. Categories of CGPA (or Gender?) with percetages of Yes/NO of...
    - MS
    - Depression
    - Anxiety
    - PA
    - Specialist
3. Compare the two results of each question to find trends.

In [None]:
sns.histplot(data=df, x="flipper_length_mm", hue="species", multiple="stack")