# Pre Work Installations for Kaggle API
### API Key Creation is needed. (https://www.kaggle.com/settings)
### Commands below installs kaggle, moves api key into needed directory, downloads needed datasheet and moves it into data folder.
### *Run only Once

In [None]:
!pip install kaggle

In [None]:
#Create API Key here
#https://www.kaggle.com/settings
#Scroll down to API and create new key, should download a json file in downloads folder.

#Windows
!cp "%USERPROFILE%/Downloads/kaggle.json" "%USERPROFILE%/.kaggle/kaggle.json"

#Linux
# !cp "~/Downloads/kaggle.json" "~/.kaggle/kaggle.json"

In [None]:
#https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
!kaggle datasets download -d "souvikahmed071/social-media-and-mental-health"

In [None]:
#Windows
!mkdir "%USERPROFILE%/.kaggle"

#Linux/Mac
# !mkdir ~/.kaggle

In [None]:
#Install Unzip command
!pip install unzip

In [None]:
#Unzip downloaded datasheet into newly created data folder
!unzip social-media-and-mental-health.zip -d data/

In [None]:
#Do not Need
!rm data/Correlation_between_Social_Media_use_and_Mental_Health.ipynb data/README.md

# Begin Here

In [None]:
#importing dependencies 
import hvplot.pandas
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
from scipy.stats import linregress
from scipy.stats import linregress
from scipy.stats import pearsonr
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind

In [None]:
main_df = pd.read_csv("archive/smmh.csv")
main_df

In [None]:
def relabel_averageTime(row):
    if row['8. What is the average time you spend on social media every day?'] in ['Less than an Hour','Between 1 and 2 hours', 'Between 2 and 3 hours']:
        return '0-3 hours'
    elif row['8. What is the average time you spend on social media every day?'] in ['Between 3 and 4 hours', 'Between 4 and 5 hours']:
        return '3-5 hours'
    elif row['8. What is the average time you spend on social media every day?'] in ['More than 5 hours']:
        return '5+ hours'

#Run the apply method to df for each row calling relabel function
main_df['Average Time on Social Media'] = main_df.apply(lambda row: relabel_averageTime(row), axis=1)




In [None]:
#Keep only rows who use social media
main_df = main_df.loc[main_df["6. Do you use social media?"]=="Yes", :].copy()


In [None]:
column_list = main_df.columns.tolist()
print(column_list)

In [None]:
main_df

## Age Groups Surveyed

In [None]:
ages_surveyed = main_df.iloc[:, 1].value_counts()
#print(ages_surveyed.head(10))
print(ages_surveyed.tail(10))

In [None]:
# Initial bar chart showing age distribution of those surveyed
plt.bar(ages_surveyed.index.values,ages_surveyed.values)

# Rotate drug names for readability
plt.xticks(rotation=0)

# X and Y axis names
plt.xlabel("Ages of Those Surveyed")
plt.ylabel("Total per Age")
plt.show()


In [None]:
# Custom age ranges
bins = [0, 9, 19, 24, 29, 39, 49, 59, float('inf')]

# Labels for the age groups
labels = ['0-9', '10-19', '20-24', '25-29', '30-39','40-49','50-59', '60-95']  

main_df['Age Groups'] = pd.cut(main_df['1. What is your age?'], bins=bins, labels=labels,include_lowest=True)

# Count the number of individuals in each age group
age_group_counts = main_df['Age Groups'].value_counts()

# List ascending age groups 
age_group_counts=age_group_counts.sort_index()

# Plotting the bar chart
age_group_counts.plot(kind='bar')

# Adding some personality to the chart
plt.xlabel('Age Groups')
plt.ylabel('Count')
plt.title('Age Group Distribution')
plt.xticks(rotation=0)

# Display the chart
plt.show()

## Genders Surveyed

In [None]:
# Catalogue all genders surveyed
genders_surveyed = set(main_df['2. Gender'])
print(genders_surveyed)

In [None]:
# Create an "Others" group so results fall under "Male", "Female", or "Other"
main_df.replace('unsure ','Others', inplace=True)
main_df.replace('There are others???','Others', inplace=True)
main_df.replace('NB','Others', inplace=True)
main_df.replace('Trans','Others', inplace=True)
main_df.replace('Non binary ','Others', inplace=True)
main_df.replace('Nonbinary ','Others', inplace=True)
main_df.replace('Non-binary','Others', inplace=True)

genders_surveyed = set(main_df['2. Gender'])
print(genders_surveyed)

In [None]:
# Counts for each gender category
gender_counts = main_df['2. Gender'].value_counts()
gender_counts

In [None]:
genders_surveyed = main_df['2. Gender'].value_counts()

# Plotting the bar chart
plt.bar(gender_counts.index, gender_counts)
plt.xlabel('Gender')
plt.ylabel('Count')
plt.title('Survey Gender Distribution')

# Rotating x-axis labels 
plt.xticks(rotation=45)

# Adding percentagess to bar chart
total = gender_counts.sum()
for i, count in enumerate(gender_counts):
    percentage = count / total * 100
    plt.text(i, count, f'{percentage:.1f}%', ha='center', va='bottom')

plt.show()

In [None]:
main_df

# Distraction

## Is Social Media a distraction from other important tasks?
- From the 'Distraction from Social Media chart', we can see the count of people that have ratings of 4 and 5, for how easily they are distracted by Social Media, are over the 100 count mark. Contrarily, people that rated 1 and 2 are under the 90 count mark. The chart shows that a significant amount of people are easily distracted by Social Media. We will factor in age, gender, occupation, average time spent on social media, to determine if there is a an existing correlation. The ratings of 1 -5 in this section represents how easily people are distracted by social media with 1 being not at all to 5 being extremely distracted. We also found the mean distraction rating score to use it as a standard when performing statistical tests on the gender categories.

### Does Age, Gender, Average Time Spent on Social Media, and Occupation play a role in how easily one is by social media?
- There are no significant differences between different genders in terms of how easily they are distracted by social media. The correlation that exists between the two factors is very weak since it's close to 0. This suggests that gender is not a significant factor that contributes how easily people are distracted by social media.
- There is a weak negative correlation between age and distraction rating which suggests that as age increases, the likelihood of being distracted decreases. 
- There is a moderate correlation that exists between average time spent on social media and how easily an individual is distracted by social media. The 'Distraction Level by Average Time Spent' bar chart shows that there is an increasing trend of higher ratings as more time is spent on social media. 
- A weak correlation exists between occupation and the distraction level which suggests that depending on the nature of occupation, the level of distraction varies. In this case, University Students tend to experience higher levels of distraction 

### Is there a correlation between age and the average time spent on social media based off of how easily they are distracted by social media?
- There is almost no correlation between age and average time spent on social media for distraction depending on the correlation coefficient = 0.02. There is also an extremely high p value which suggests that the correlation itself is not significant

### Is there a correlation between occupation and the average time spent on social media based off of how easily they are distracted by social media?
- A weak negative correlation exists between these two factors. Because the correlation coefficient is at -0.1919,here is a slight tendency that people with different occupations tend to spend different times on social media 
- The heatmap 'Distraction Rating by Occupation and Average Time on Social Media' also shows that different ratings were dependent on the occupation group and average time spent. For instance, university students show a relatively consistency in their distraction ratings across the different average times spent on social media. This conclusion, shows that other groups had higher tendencies of being distracted dependent on the average time spent compared to when we were just testing against occupation and distraction. In the previous case, university students tend to have the most amount distraction without taking average time into factor.

## Plotting the distribution of the population based on rating

In [None]:
# Create a dataframe that represents how often people get distracted by Social Media
distraction_occurrence = main_df['10. How often do you get distracted by Social media when you are busy doing something?']
ratings = ['1', '2', '3', '4', '5']

# Display the information via bar chart
dist_count = distraction_occurrence.value_counts().sort_index()
plt.bar(ratings, dist_count)
plt.title('Distraction from Social Media')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.savefig('output_data/distraction/distraction_demographics')
plt.show

In [None]:
# Find the standard mean of the dataset for distraction level:
mean_distraction = main_df['10. How often do you get distracted by Social media when you are busy doing something?'].mean()
mean_distraction

## Age 

- One of the findings we observed when performing statistical testing on Age and Distraction Ratings is that there is a weak negative correlation between age and distraction (-0.2192). From the 'Distraction Ratings by Age Group' line chart, we observed the patterns of each rating as age increases. In Age Groups 10-19 and 25-29 where the population was approximately similar, the results of the ratings were vastly different. People in the older age group of 25-29 were more likely to be distracted than those of belonging 10-19. As the age increases, we see that the distraction rating that represents the most amount of response for ratings decrease. Age groups 30-39 had the most amount of Distraction Ratings 2 compared to other distraction ratings and Age Groups 40-49 and 50-59 had the most ratings being at 1. This supports the  negative correlation that possibly as the age increases, the likelihood of being distracted also decreases. However we do want to take note that it is a weak correlation. 
- Notes: 
     - The distribution among the dataset shows that a significant portion of the population are made up by people in their early 20s, specifically 58% are represented by people from ages 20-24. Since the data is skewed towards this particular age group, further testing and analysis needs to be conducted to support the findings. 

In [None]:
# Create age dataframe: Q10
age_dist_group = main_df.groupby(['Age Groups', '10. How often do you get distracted by Social media when you are busy doing something?'])
age_dist_count = age_dist_group.size().reset_index(name='Rating Count per Age Group')
age_dist_count.head()

In [None]:
# Show info on pivot table for Q10
age_dist_pt = age_dist_count.pivot(index='Age Groups', columns='10. How often do you get distracted by Social media when you are busy doing something?', values='Rating Count per Age Group')

# Calculate Total 
age_dist_pt['Total'] = age_dist_pt.sum(axis=1)
age_dist_pt.loc['Total'] = age_dist_pt.sum(axis=0)
age_dist_pt

In [None]:
# Drop total ratings from Pivot Table before plotting
age_dist_pt = age_dist_pt.drop('Total', axis=0)
age_dist_pt = age_dist_pt.drop('Total', axis=1)

In [None]:
# Plot Age Distribution Ratings for Distraction
age_dist_pt.plot(kind='bar')

# Set axis labels and title
plt.xlabel('Age Group')
plt.ylabel('Response')
plt.title('Distraction Level by Age Group')

# Show the legend
plt.legend(title = 'Rating', loc='upper right', bbox_to_anchor=(1, 1))

# Save output
plt.savefig('output_data/distraction/Distraction_Age_Distribution')

# Show the chart
plt.show()

In [None]:
# Create a chart to show existing trends
# Get the age groups
age_groups = age_dist_pt.index.tolist()

# Set the x-axis values
x_age = range(len(age_groups))

# Plot the trend chart
plt.figure(figsize=(10, 6))
for rating in age_dist_pt.columns:
    plt.plot(x_age, age_dist_pt[rating], marker='o', label=f'Distraction Rating {rating}')

# Set x-axis labels
plt.xticks(x_age, age_groups)

# Set axis labels and title
plt.xlabel('Age Groups')
plt.ylabel('Rating Count')
plt.title('Distraction Ratings by Age Group')

# Show the legend
plt.legend()

# Save output
plt.savefig('output_data/distraction/Distraction_Age_Trend')

# Show the chart
plt.show()

In [None]:
# Find correlation coefficient
distraction_ratings = age_dist_pt.columns
age_groups = ['0-9', '10-19', '20-24', '25-29', '30-39', '40-49', '50-59', '60-95']

# Create empty lists to store the data
age_x_values = []
age_y_values = []

# Extract information from pivot table
for i, age_group in enumerate(age_groups):
    for rating in distraction_ratings:
        count = age_dist_pt.loc[age_group, rating]
        age_x_values.extend([i] * count)  # Assign numerical values to age groups
        age_y_values.extend([float(rating)] * count)

# Calculate the correlation coefficient
correlation_coefficient, p_value = stats.pearsonr(age_x_values, age_y_values)

# Print the correlation coefficient
print(f"Pearson's correlation coefficient: {correlation_coefficient:.4f}")
print(f"P-value: {p_value:.4f}")

## Gender 

- From the pie charts, we can see that the distribution of ratings based on gender categories are not too different from each other. Specifically, the makeup of the ratings between both male and female groups are extremely similar from one another with rating 1 being the smallest. The only difference between the two charts is that the largest makeup for females was rating 5 and for male it was rating 4. The correlation coefficient (-0.0685) showed that there was an extremely weak negative correlation between gender and distraction ratings. As the p-value was 0.1379, it supports that the correlation is not statistically significant. We also performed statistical testing that compared the results of female and male distraction ratings. The p-value is 0.0690 indicates that there is no significant difference between the two groups. This supports the visualization provided by the pie charts

- Notes:
    - As the "Others" category (1.5%) was disproportional compared to the representation of people that made up of Female and Male categories, it was excluded from being tested against them. Therefore the t test was only performed on the female and male groups comparing the two. 

In [None]:
# Create dataframe for gender - Q10 
gender_dist_group = main_df.groupby(['2. Gender', '10. How often do you get distracted by Social media when you are busy doing something?'])
gender_dist_count = gender_dist_group.size().reset_index(name='Rating Count per Gender')
gender_dist_count.head()

In [None]:
# Show info on pivot table for Q10
gender_dist_pt = gender_dist_count.pivot(index='2. Gender', columns='10. How often do you get distracted by Social media when you are busy doing something?', values='Rating Count per Gender')

# Calculate Total 
gender_dist_pt['Total'] = gender_dist_pt.sum(axis=1)
gender_dist_pt.loc['Total'] = gender_dist_pt.sum(axis=0)
gender_dist_pt

In [None]:
# Drop total ratings from Pivot Table before plotting
gender_dist_pt = gender_dist_pt.drop('Total', axis=0)
gender_dist_pt = gender_dist_pt.drop('Total', axis=1)

In [None]:
# Get how many gender categories there are
gender_count = len(gender_dist_pt)

# Create separate pie charts for each gender
for gender in gender_dist_pt.index:
    data = gender_dist_pt.loc[gender]
    plt.figure()
    plt.pie(data, labels=data.index, autopct='%1.1f%%', startangle=90)
    plt.title(gender)
    
plt.savefig('output_data/distraction/Distraction_Gender_Distribution')
plt.tight_layout()
plt.show()

In [None]:
# Find correlation coefficient between ratings and gender differences
distraction_ratings = gender_dist_pt.columns
genders = gender_dist_pt.index

# Create empty lists to store the data
gender_x_values = []
gender_y_values = []

# Extract information from pivot table
for i, gender in enumerate(genders):
    for rating in distraction_ratings:
        count = gender_dist_pt.loc[gender, rating]
        gender_x_values.extend([i] * count)  # Assign numerical values to genders
        gender_y_values.extend([float(rating)] * count)

# Calculate the correlation coefficient
correlation_coefficient, p_value = stats.pearsonr(gender_x_values, gender_y_values)

# Print the correlation coefficient
print(f"Pearson's correlation coefficient: {correlation_coefficient:.4f}")
print(f"P-value: {p_value:.4f}")

In [None]:
# Perform a one-sample t-test for each gender category
for gender in gender_dist_pt.index:
    ratings = gender_dist_pt.loc[gender].values[1:]  # Exclude the first column (gender category)
    t_statistic, p_value = stats.ttest_1samp(ratings, mean_distraction)
    print(f"One-sample t-test for {gender}:")
    print(f"T-statistic: {t_statistic:.4f}")
    print(f"P-value: {p_value:.4f}")
    print()

In [None]:
# Extract the distraction ratings for Female and Male from the pivot table
ratings_female = gender_dist_pt.loc['Female'][1:6]
ratings_male = gender_dist_pt.loc['Male'][1:6]

# Run the two-sample t-test
t_statistic, p_value = stats.ttest_ind(ratings_female, ratings_male, equal_var=False)

print("Two-sample t-test results:")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

## Average Time Spent on Social Media

- The distribution across the 3 categories of average time spent on social media (0-3 hours, 3-5 hours, and 5+ hours) is more proportionate compared to the previous factors studied. The correlation coefficient at 0.29 suggests that there is a moderate positive correlation between average time spent and how easily distracted people are by social media. This suggests that there is some sort of tendency that people who spends more time on social media will experience higher levels of distraction. This is supported by the 'Distraction Level by Average Time Spent' chart, where people who spent 5+ hours on social media had the most ratings for being extremely distracted (5) where as people who spent 0-3 hours on social media had a relative uniformity across all ratings. From one group of average time spent to the next, there is an increasing trend of higher ratings as more time is spent on social media.
- There are various factors that could play a role in the relationship between how much time one spends on social media and how easily they are distracted by it. Eenvironmental and individual differences are factors that could play a role in how easily one becomes distracted by social media. Therefore, more research needs to be conducted to have a comprehensive understanding of the relationship between average time spent on social media and the level of distraction one is influenced by social media

In [None]:
# Create dataframe for time - Q10 
time_dist_group = main_df.groupby(['Average Time on Social Media', '10. How often do you get distracted by Social media when you are busy doing something?'])
time_dist_count = time_dist_group.size().reset_index(name='Rating Count per Avg Time')
time_dist_count.head()

In [None]:
# Show info on pivot table for Q10
time_dist_pt = time_dist_count.pivot(index='Average Time on Social Media', columns='10. How often do you get distracted by Social media when you are busy doing something?', values='Rating Count per Avg Time')

# Calculate Total 
time_dist_pt['Total'] = time_dist_pt.sum(axis=1)
time_dist_pt.loc['Total'] = time_dist_pt.sum(axis=0)
time_dist_pt

In [None]:
# Drop total ratings from Pivot Table before plotting
time_dist_pt = time_dist_pt.drop('Total', axis=0)
time_dist_pt = time_dist_pt.drop('Total', axis=1)

In [None]:
# Plot Time Distribution Ratings for Distraction
time_dist_pt.plot(kind='bar')

# Set axis labels and title
plt.xlabel('Avg Time')
plt.ylabel('Response')
plt.title('Distraction Level by Average Time Spent')

# Plot the legend
plt.legend(title = 'Rating', loc='upper left', bbox_to_anchor=(1, 1))

# Save the figure
plt.savefig('output_data/distraction/Distraction_Average_Time_Spent')


In [None]:
# Find correlation coefficient between average time on social media and distraction rating
distraction_ratings = time_dist_pt.columns
average_times = time_dist_pt.index

# Create empty lists to store the data
time_x_values = []
time_y_values = []

# Extract information from pivot table
for i, time_category in enumerate(average_times):
    for rating in distraction_ratings:
        count = time_dist_pt.loc[time_category, rating]
        time_x_values.extend([i] * count)  # Assign numerical values to average time categories
        time_y_values.extend([float(rating)] * count)

# Calculate the correlation coefficient
correlation_coefficient, p_value = stats.pearsonr(time_x_values, time_y_values)

# Print the correlation coefficient
print(f"Pearson's correlation coefficient: {correlation_coefficient:.4f}")

## Occupation

- There is a weak positive correlation between the occupation and distraction at 0.2233. From the 'Distraction Level by Occupation' chart, university students are more prone to experiencing higher levels of distraction. as opposed to salaried workers, school students, and retired people. In the case of salaried workers, distraction levels of 4 and 5 were lower than the other 3 distraction ratings compared to students. This suggests that maybe due to the nature of their work environment, time availability, and work/stress levels, salaried workers could potentially have more job responsibilities that require less time for them to go on social media. 
- Because University students overrepresent the data, the statistical findings could be skewed. Therefore, additional testing needs to be conducted to understand the nature of occupation. For instance, salaried workers could also be separated into different categories depending on the type of work they do. This would provide further analysis on how occupation plays a role in distraction levels by social media. However, the given dataset doesn't have that information.

In [None]:
# Create dataframe for occupation - Q10 
occupation_dist_group = main_df.groupby(['4. Occupation Status', '10. How often do you get distracted by Social media when you are busy doing something?'])
occupation_dist_count = occupation_dist_group.size().reset_index(name='Rating Count per Occupation')
occupation_dist_count.head()

In [None]:
# Show info on pivot table for Q10
occupation_dist_pt = occupation_dist_count.pivot(index='4. Occupation Status', columns='10. How often do you get distracted by Social media when you are busy doing something?', values='Rating Count per Occupation')

# Calculate Total 
occupation_dist_pt['Total'] = occupation_dist_pt.sum(axis=1)
occupation_dist_pt.loc['Total'] = occupation_dist_pt.sum(axis=0)
occupation_dist_pt

In [None]:
# Drop total ratings from Pivot Table before plotting
occupation_dist_pt = occupation_dist_pt.drop('Total', axis=0)
occupation_dist_pt = occupation_dist_pt.drop('Total', axis=1)

In [None]:
# Plot Distraction Rating by Occupation
occupation_dist_pt.plot(kind='bar')

# Set axis labels and title
plt.xlabel('Occupation')
plt.ylabel('Response')
plt.title('Distraction Level by Occupation')

# Plot legend
plt.legend(title= 'Rating', loc='upper left', bbox_to_anchor=(1, 1))

# Save figure
plt.savefig('output_data/distraction/distraction_by_occupation')

# Show plot
plt.show()

In [None]:
# Find correlation coefficient between occupation and distraction rating
distraction_ratings = occupation_dist_pt.columns
occupations = occupation_dist_pt.index

# Create empty lists to store the data
occupation_x_values = []
occupation_y_values = []

# Extract information from pivot table
for i, occupation in enumerate(occupations):
    for rating in distraction_ratings:
        count = occupation_dist_pt.loc[occupation, rating]
        occupation_x_values.extend([i] * count)  # Assign numerical values to occupations
        occupation_y_values.extend([float(rating)] * count)

# Calculate the correlation coefficient
correlation_coefficient, p_value = stats.pearsonr(occupation_x_values, occupation_y_values)

# Print the correlation coefficient
print(f"Pearson's correlation coefficient: {correlation_coefficient:.4f}")

## Age Groups and Average Times on Social Media

- We can assume that there is almost no correlation between age and average time spent on social media for distraction depending on the p value being at 0.8406 and the correlation coefficient at 0.02. 

- Based on the heatmap 'Distraction Rating by Age Group and Average Time on Social Media', we can observe that the people from ages 40-49 and 60-95 have the highest average distraction ratings as they spend 5+ hours on social media. In comparison, the youngest age group (10-19) shows lower average distraction ratings across all time averages spent on social media. Because age group 50-59 have missing values(NaN) in the 5+ hours category, there is a lack of information to compare this data set with other age groups.

In [None]:
main_df_copy = main_df.copy()
main_df_copy.loc[:, 'Average Time on Social Media'] = main_df_copy['Average Time on Social Media'].astype(str)
age_avg_time_dist = main_df_copy.groupby(['Age Groups', 'Average Time on Social Media'])['10. How often do you get distracted by Social media when you are busy doing something?'].mean().reset_index()
age_avg_time_dist

In [None]:
# Show info on pivot table for Q10
age_time_pt = main_df_copy.pivot_table(index='Age Groups', columns='Average Time on Social Media', values='10. How often do you get distracted by Social media when you are busy doing something?', aggfunc='mean')
age_time_pt = age_time_pt.sort_index()

# Calculate Total Mean
age_time_pt['Total Average'] = age_time_pt.mean(axis=1)
age_time_pt.loc['Total Average'] = age_time_pt.mean(axis=0)
age_time_pt

In [None]:
# Drop total ratings from Pivot Table before plotting
age_time_pt = age_time_pt.drop('Total Average', axis=0)
age_time_pt = age_time_pt.drop('Total Average', axis=1)

In [None]:
# Sort by age order
age_order = ['60-95','50-59', '40-49', '30-39', '25-29','20-24', '10-19', '<10' ]
age_time_pt = age_time_pt.reindex(age_order)

# Get the tick labels for x-axis and y-axis
x_ticks = age_time_pt.columns
y_ticks = age_time_pt.index

# Create the heatmap
heatmap= age_time_pt.values

plt.figure(figsize=(10, 6))
plt.imshow(heatmap, cmap='cool')
cbar = plt.colorbar()
cbar.set_label('Level of Distraction')

# Set ticks
plt.xticks(np.arange(len(x_ticks)), x_ticks)
plt.yticks(np.arange(len(y_ticks)), y_ticks)

plt.xticks(rotation=90)

# Set axis labels and title
plt.xlabel('Average Time on Social Media')
plt.ylabel('Age Group')
plt.title('Distraction Rating by Age Group and Average Time on Social Media')

#Save figure
plt.savefig('output_data/distraction/Age_Time_Distraction_Heatmap')

plt.show()

In [None]:
# Extract information from pivot table
age_groups = age_time_pt.index.tolist()
average_times = age_time_pt.columns.tolist()

age_x_values = []
age_y_values = []

for i, age_group in enumerate(age_groups):
    for j, average_time in enumerate(average_times):
        rating = age_time_pt.loc[age_group, average_time]
        if pd.notnull(rating):  # Check for NaN values
            rating_int = int(rating)  # Convert the rating to an integer
            age_x_values.extend([i] * rating_int)  # Assign numerical values to age groups
            age_y_values.extend([j] * rating_int)  # Assign numerical values to average time

# Calculate the correlation coefficient
correlation_coefficient, p_value = stats.pearsonr(age_x_values, age_y_values)

# Print the correlation coefficient
print(f"Pearson's correlation coefficient between Average Time on Social Media and Age Groups: {correlation_coefficient:.4f}")
print(f"P-value: {p_value:.4f}")

## Occupation and Average Time on Social Media

- There is a weak negative correlation between occupation and average time spent on social media based on distraction ratings. Since the correlation coefficient is at -0.1919, there is a slight tendency that people with different occupations tend to spend different times on social media. Because the p-value is at 0.2623 we fail to reject the null hypothesis that there is no correlation between the Average Time on Social Media and the Occupation. However, further testing needs to be conducted to determine different occupation impact the average time spent on social media and how distracted they are from social media based on these two factors.
- The heatmap 'Distraction Rating by Occupation and Average Time on Social Media' portrays that retired workers have a tendency to spend more time on social media, which suggest they are more likely to be distractedby social media. However given that the population that is represented by retired workers is at 1.7%, there is too little data to suggest the accuracy of this result. On the other hand, school students and salaried workers have higher distraction ratings as they spend more time on social media. In comparison, university students show a relatively consistent distraction ratings across the different average times spent on social media. 

In [None]:
# Create data frame
occ_avg_time_dist = main_df_copy.groupby(['4. Occupation Status', 'Average Time on Social Media'])['10. How often do you get distracted by Social media when you are busy doing something?'].mean().reset_index()
occ_avg_time_dist

In [None]:
# Show info on pivot table for Q10
occ_time_pt = main_df.pivot_table(index='4. Occupation Status', columns='Average Time on Social Media', values='10. How often do you get distracted by Social media when you are busy doing something?', aggfunc='mean')
occ_time_pt = occ_time_pt.sort_index()

# Calculate Total Mean
occ_time_pt['Total Average'] = occ_time_pt.mean(axis=1)
occ_time_pt.loc['Total Average'] = occ_time_pt.mean(axis=0)
occ_time_pt

In [None]:
# Drop total ratings from Pivot Table before plotting
occ_time_pt = occ_time_pt.drop('Total Average', axis=0)
occ_time_pt = occ_time_pt.drop('Total Average', axis=1)

In [None]:
# Sort the index
occ_time_pt = occ_time_pt.sort_index()

# Get the tick labels for x-axis and y-axis
x_ticks = occ_time_pt.columns
y_ticks = occ_time_pt.index

# Create the heatmap
plt.figure(figsize=(10, 6))
plt.imshow(occ_time_pt, cmap='cool', aspect='auto')
cbar = plt.colorbar()
cbar.set_label('Distraction Rating')

# Set ticks
plt.xticks(range(len(x_ticks)), x_ticks, rotation=90)
plt.yticks(range(len(y_ticks)), y_ticks)

# Set axis labels and title
plt.xlabel('Average Time on Social Media')
plt.ylabel('Occupation')
plt.title('Distraction Rating by Occupation and Average Time on Social Media')

# Save fig
plt.savefig('output_data/distraction/occupation_time_distraction')

plt.show()

In [None]:
# Extract information from pivot table
occupations = occ_time_pt.index.tolist()
average_times = occ_time_pt.columns.tolist()

occupation_x_values = []
occupation_y_values = []

for i, occupation in enumerate(occupations):
    for j, average_time in enumerate(average_times):
        rating = occ_time_pt.loc[occupation, average_time]
        if pd.notnull(rating):  # Check for NaN values
            rating_int = int(rating)  # Convert the rating to an integer
            occupation_x_values.extend([i] * rating_int)  # Assign numerical values to occupations
            occupation_y_values.extend([j] * rating_int)  # Assign numerical values to average time

# Calculate the correlation coefficient
correlation_coefficient, p_value = stats.pearsonr(occupation_x_values, occupation_y_values)

# Print the correlation coefficient and p-value
print(f"Pearson's correlation coefficient between Average Time on Social Media and Occupation: {correlation_coefficient:.4f}")
print(f"P-value: {p_value:.4f}")

# Collection of App Usage vs Mental Health

## Does a specific SM platform or collection of platforms lead to more issues than others? 

- The most common Social Media Platform among the surveyed data were Facebook, Instagram, YouTube.
- Facebook and Instagram can be associated for comparison that may be leading to negative emotions.
- We do not have specific data regarding how a distinct platform makes the recipient feel. Deducing the data to one recorded platform from the top 3 did not lead to enough information for a conclusion.
- Users in this sample tend to use more than one platform along with the top three most popular recorded.

## Whether a collection of app platform usage leads to more of an impact on mental health?
- The distribution among the dataset had on average 3-5 social media platforms. More specifically on average 4 platforms.
- The Average Total Frequency Score saw an increase of 12% from the 1-2 social media platform group to 3-5 group which is a minor increase. Which may suggest multiple platforms can lead to more of an impact on attention throughout the day when engaging in multiple platforms. 
- The 1-2 Social Media Platform group also had on average an hour less spent on social media than the 3-5 and 6+ platform groups.

In [None]:
#Remove Timestamp; do not really need
socialApps_df = main_df.iloc[:, 1:]
socialApps_df.head()

In [None]:
#Get Number of Social Apps into a List
appsList = socialApps_df.iloc[:, 6]

#Creating a variable to store # amount of apps
numberOfApps = []

#Creating a variable to store list of the split result string list
listOfApps = []

for app in appsList:
    listOfApps.append(app.split(";"))

In [None]:
#Add Number of Apps to DF

listOfNumberApps = [len(x) for x in listOfApps]

#Add to social apps df
socialApps_df['Number of Apps'] = listOfNumberApps

In [None]:
#Rename columns to respective type of question.
socialApps_df.rename(columns = {'9. How often do you find yourself using Social media without a specific purpose?':'ADHD Q1',
                       '10. How often do you get distracted by Social media when you are busy doing something?':'ADHD Q2',
                       "11. Do you feel restless if you haven't used Social media in a while?":'Anxiety Q1',
                       '12. On a scale of 1 to 5, how easily distracted are you?':'ADHD Q3',
                       '13. On a scale of 1 to 5, how much are you bothered by worries?':'Anxiety Q2',
                       '14. Do you find it difficult to concentrate on things?':'ADHD Q4',
                       '15. On a scale of 1-5, how often do you compare yourself to other successful people through the use of social media?':'Self Esteem Q1',
                       '17. How often do you look to seek validation from features of social media?':'Self Esteem Q2',
                       '18. How often do you feel depressed or down?':'Depression Q1',
                       '19. On a scale of 1 to 5, how frequently does your interest in daily activities fluctuate?':'Depression Q2',
                       '20. On a scale of 1 to 5, how often do you face issues regarding sleep?':'Depression Q3' },inplace=True)


In [None]:
# Custom app ranges
appBins = [1, 2, 5, 20]

# Labels for the app groups
appLabels = ['1-2', '3-5', '6+']  

#Bin the new groups
socialApps_df['App Groups'] = pd.cut(socialApps_df['Number of Apps'], bins=appBins, labels=appLabels,include_lowest=True)

In [None]:
socialApps_df.head()

In [None]:
#Create a chart to detail time spent on social media vs amount of platforms
statSummary = socialApps_df.groupby('Average Time on Social Media')

averageNumberOfApps = statSummary['Number of Apps'].mean()

plt.bar(averageNumberOfApps.index, averageNumberOfApps, edgecolor='black')
plt.xlabel("Average Time Spent on Social Media")
plt.ylabel("Average Amount of Apps")
plt.title('Average Time Spent vs Average Amount of Platforms')
plt.yticks(np.arange(0,5,step=0.5))
plt.show()
averageNumberOfApps

In [None]:
from collections import Counter

#Flatten List of lists ex: ([["A"], ["B"], ["C"]] = ["A", "B", "C"])
appsTotalList = [item for sublist in listOfApps for item in sublist]

#Count total amount of recorded platforms
recordedAppsTotal = Counter(appsTotalList)

#Create into DF
recordedAppsTotal_df = pd.DataFrame.from_dict(recordedAppsTotal, orient='index', columns=["Total"])

In [None]:
#Plot the recorded results of platforms
plt.bar(recordedAppsTotal_df.index, recordedAppsTotal_df['Total'], width=0.6, align='center', color='blue', edgecolor='black')
plt.xticks(rotation=45)
plt.title('Total Amount for Platforms Used')
plt.ylabel('Total Amount Reported')
plt.xlabel('Social Media Platforms')
plt.show()
recordedAppsTotal_df

In [None]:
#Create a column for ADHD Total questions, 4 Question total (20 points Max)
socialApps_df['ADHD Total Score'] = socialApps_df['ADHD Q1'] + socialApps_df['ADHD Q2'] + socialApps_df['ADHD Q3'] + socialApps_df['ADHD Q4']

#Create a column for Self Esteem Total questions, 2 Question total (10 points Max)
socialApps_df['Self Esteem Total Score'] = socialApps_df['Self Esteem Q1'] + socialApps_df['Self Esteem Q2']

#Create a column for Anxiety Total questions, 2 Question total (10 points Max)
socialApps_df['Anxiety Total Score'] = socialApps_df['Anxiety Q1'] + socialApps_df['Anxiety Q2']

#Create a column for Depression total questions, 3 Question total (15 points Max)
socialApps_df['Depression Total Score'] = socialApps_df['Depression Q1'] + socialApps_df['Depression Q2'] + socialApps_df['Depression Q3']

#Create a column for Total Amount of questions, 11 Question total (55 points Max)
socialApps_df['Total Score'] = socialApps_df['ADHD Total Score'] + socialApps_df['Self Esteem Total Score'] + socialApps_df['Anxiety Total Score'] + socialApps_df['Depression Total Score']


In [None]:
#Filter for specific platform of the top 3, Facebook
filtered_list = []
for index, row in socialApps_df.iterrows():
    temp = row['7. What social media platforms do you commonly use?'].split()
    if 'Facebook' in temp and len(temp) == 1:
        filtered_list.append(row)
filtered_df = pd.DataFrame(filtered_list)
filtered_df.head()

In [None]:
#Filter for specific platform of the top 3, Instagram
filtered_list_in = []
for index, row in socialApps_df.iterrows():
    temp = row['7. What social media platforms do you commonly use?'].split()
    if 'Instagram' in temp and len(temp) == 1:
        filtered_list_in.append(row)
filteredIn_df = pd.DataFrame(filtered_list_in)
filteredIn_df.head()

In [None]:
#Filter for specific platform of the top 3, YouTube
filtered_list_yt = []
for index, row in socialApps_df.iterrows():
    temp = row['7. What social media platforms do you commonly use?'].split()
    if 'YouTube' in temp and len(temp) == 1:
        filtered_list_yt.append(row)
filteredYT_df = pd.DataFrame(filtered_list_yt)
filteredYT_df.head()

In [None]:
#Plot the individual platform average scores.
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)

ax1.boxplot(filtered_df['Total Score'])
ax2.boxplot(filteredIn_df['Total Score'])
ax3.boxplot(filteredYT_df['Total Score'])
ax1.set_ylabel('Average Frequency Score from Recipients')
ax2.set_xlabel('Platform')
ax2.set_title('Platform vs Average Total Score')
ax1.set_xticklabels(['Facebook'])
ax2.set_xticklabels(['Instagram'])
ax3.set_xticklabels(['YouTube'])
ax2.set_yticks(np.arange(32, 50, step=2))
plt.yticks(np.arange(5, 50, step=5))
plt.show()

In [None]:
#Show columns in social apps
socialApps_df.columns

In [None]:
#Aggregate Averages for Tendency by Category and Total
totalScoreNumberOfApps = socialApps_df.groupby('App Groups').agg({'ADHD Total Score': 'mean', 'Self Esteem Total Score': 'mean',
       'Anxiety Total Score': 'mean', 'Depression Total Score': 'mean', 'Total Score': 'mean'})

X_axis = np.arange(len(totalScoreNumberOfApps.index))

#Plot the average question type scores amongst the groups
plt.bar(X_axis - 0.34, totalScoreNumberOfApps['Self Esteem Total Score'], width=0.2, edgecolor='black', zorder=3)
plt.bar(X_axis - 0.11, totalScoreNumberOfApps['Anxiety Total Score'], width=0.2, edgecolor='black', zorder=3)
plt.bar(X_axis + 0.11, totalScoreNumberOfApps['Depression Total Score'], width=0.2, edgecolor='black', zorder=3)
plt.bar(X_axis + 0.34, totalScoreNumberOfApps['ADHD Total Score'], width=0.2, edgecolor='black', zorder=3)
plt.xticks(X_axis, totalScoreNumberOfApps.index)
plt.yticks(np.arange(0, 20 , step=2))
plt.grid(axis='y', color='gray', linewidth=0.4, zorder=0)
plt.legend(['Avg Self Esteem Total Score (2 Q)', 'Avg Anxiety Total Score (2 Q)', 'Avg Depression Total Score (3 Q)','Avg ADHD Total Score (4 Q)'], bbox_to_anchor=(1, 1))
plt.ylabel('Average Scores *Higher being worse')
plt.xlabel('Collection of Platform Groups')
plt.annotate("31", (0,14))
plt.annotate("35", (1,14))
plt.annotate("37", (2,14))
plt.title('Average Tendency Score vs Number of Platorms')
plt.savefig('output_data/averageTendencyScorePlatforms')
plt.show()
totalScoreNumberOfApps

In [None]:
#Plot the distribution of Platforms used
variationNumberOfApps = socialApps_df.groupby('Number of Apps')

test3 = variationNumberOfApps.count()
plt.bar(test3.index, test3['2. Gender'], align='center', edgecolor='black')
plt.ylabel('Total Number of Recipients')
plt.xlabel('Number of Platforms')
plt.title('Distribution of Number of Platforms vs Number of Recipients')
plt.xticks(test3.index)
plt.savefig('output_data/distributionNumberOfPlatforms')
plt.show()

In [None]:
#Scatter plot number of apps vs total frequency score
plt.scatter(socialApps_df['Number of Apps'], socialApps_df['Total Score'] , marker='o', alpha=0.6, edgecolors='black', s=60)

#Perform linear regression
slope, intercept, r, p, stderr = linregress(socialApps_df['Number of Apps'], socialApps_df['Total Score'])

#y=mx+b
line = slope * socialApps_df['Number of Apps'] + intercept

#Create plot
plt.plot(socialApps_df['Number of Apps'], line, 'r')
plt.annotate(f"y={slope:0.02f}x + {intercept:0.02f}", (6, 10), color='r')
plt.ylabel('Total Score from Recipients')
plt.xlabel('Number of Platforms')
plt.title('Total Frequency Score vs Number of Platforms')
plt.yticks(np.arange(5, 65, step=5))
plt.savefig('output_data/totalFrequencyVsNumberPlatforms')
plt.show()
#Weak Correlation
print(f"Pearson Correlation Factor: {r:0.02f}, Weak Correlation")

In [None]:
# Use group by and size function to perform the calculation
age_and_sleep_deprivation = main_df.groupby(["Age Groups","20. On a scale of 1 to 5, how often do you face issues regarding sleep?"])
age_and_sleep_deprivation_count = age_and_sleep_deprivation.size().reset_index(name='Rating Count per Age Group')
# Create a pivot table that contains count of the responses,scale wise.
age_and_sleep_dep = age_and_sleep_deprivation_count.pivot(index='Age Groups', columns='20. On a scale of 1 to 5, how often do you face issues regarding sleep?',values= 'Rating Count per Age Group')
age_and_sleep_dep['Total'] = age_and_sleep_dep.sum(axis=1)
age_and_sleep_dep.loc['Total'] = age_and_sleep_dep.sum(axis=0)
age_and_sleep_dep

In [None]:
# Drop the extra colums and rows that are not needed. 
age_and_sleep_dep = age_and_sleep_dep.drop('Total', axis=0)
age_and_sleep_dep = age_and_sleep_dep.drop('Total', axis=1)

# Plot info for people with sleep deprivation by age group and display in a bar chart.
age_and_sleep_dep.plot(kind='bar',figsize=(13,8))
plt.xlabel('Age Group')
plt.ylabel('Response')
plt.title('People with sleep deprivation by Age Group')
plt.legend(title = 'Rating', loc='upper right', bbox_to_anchor=(1, 1))
plt.savefig("output_data/sleepDeprivationAndAgegroup.png")
plt.show()

In [None]:
# Create the table with gender, use groupby and aggegrate the number of genders, scale wise.
gender_sleep_deprivation =main_df.groupby(["2. Gender","20. On a scale of 1 to 5, how often do you face issues regarding sleep?"]).agg({'2. Gender': ['count']})
gender_sleep_deprivation.reset_index()

In [None]:
# Create bin to group the scales. 
scale_binning = [0,2.5,3.5,6]
scale_range =["1-2", "3", "4-5"]
scale_group_df = main_df.copy()
scale_group_df["Scale Groups"]=pd.cut(scale_group_df["20. On a scale of 1 to 5, how often do you face issues regarding sleep?"],scale_binning, labels = scale_range, include_lowest=False)

In [None]:
# Creat a table having gender, scale groups, and count of those that lise under the particular bin.
gender_sleep_deprivation = scale_group_df.groupby(["2. Gender","Scale Groups"]).agg({'2. Gender': ['count']}).reset_index()
gender_sleep_deprivation.columns = ['Gender', 'Scale Groups', 'Count']
gender_sleep_deprivation

In [None]:
# Filter out the male,female, and others from the gender_sleep_deprivation table and plot the different pie chart to show the trend on sleep issue. 
# Separate the female data
female_data =gender_sleep_deprivation[gender_sleep_deprivation["Gender"]=="Female"]
# Separate the male data
male_data =gender_sleep_deprivation[gender_sleep_deprivation["Gender"]=="Male"]
# Separate the other data
other_data =gender_sleep_deprivation[gender_sleep_deprivation["Gender"]=="Others"]
#Use subplot to show different chart for each of the filtered out data.
plt.subplot(1, 3, 1)
plt.suptitle("Sleep deprivation in different gender", fontsize=16)
_ = plt.pie(male_data["Count"], labels=male_data["Scale Groups"], autopct="%1.1f%%")
plt.title("Male")

plt.subplot(1, 3, 2)
_ = plt.pie(female_data["Count"], labels=female_data["Scale Groups"], autopct="%1.1f%%")
_ = plt.title("Female")

plt.subplot(1,3,3)
_ = plt.pie(other_data["Count"], labels=other_data["Scale Groups"], autopct="%1.1f%%")
_ = plt.title("Others")
plt.tight_layout(pad=1.0)
plt.savefig("output_data/sleepDeprivationAndGender.png")


## Sleep Deprivation in different gender
From the above pie chart, we can conclude that there is no big significance in rating for sleep deprivation in different genders. The ratings between both male and female groups are almost similar. I have tried to plot the chart for others even-though the data set was extremely low. We can see the similar trend between all the genders.There isn’t specific pattern. All the genders male, female and others are equally affected by the social media in term of sleep. They all are being sleep deprived by the social media. So we can say that the social media is being an issue and affecting all genders on their mental health.

In [None]:
sleep_depravation_rating = main_df["20. On a scale of 1 to 5, how often do you face issues regarding sleep?"].mean()
sleep_depravation_rating


In [None]:
# Use groupby to categorize data by Occupation.
occupation_sleep_dep = main_df.groupby(["4. Occupation Status","20. On a scale of 1 to 5, how often do you face issues regarding sleep?"])
occupation_sleep_deprivation_count = occupation_sleep_dep.size().reset_index(name='Rating Count per Occupation')
# Create pivot table using the the data after groupby.
occupation_sleep_deprivation = occupation_sleep_deprivation_count.pivot(index='4. Occupation Status', columns='20. On a scale of 1 to 5, how often do you face issues regarding sleep?',values= 'Rating Count per Occupation')
occupation_sleep_deprivation['Total'] = occupation_sleep_deprivation.sum(axis=1)
occupation_sleep_deprivation.loc['Total'] = occupation_sleep_deprivation.sum(axis=0)
occupation_sleep_deprivation

In [None]:
# Drop the unwanted row and column.
occupation_sleep_deprivation = occupation_sleep_deprivation.drop('Total', axis=0)
occupation_sleep_deprivation = occupation_sleep_deprivation.drop('Total', axis=1)
# Plot info
occupation_sleep_deprivation.plot(kind='bar',figsize=(13,8))
plt.xlabel('Occupation')
plt.ylabel('Response')
plt.title('People with sleep deprivation by Occupation')
plt.legend(title = 'Rating', loc='upper right', bbox_to_anchor=(1, 1))
plt.savefig("output_data/sleepDeprivationAndOccupation.png")
plt.show()

## People with sleep deprivation by Occupation
Here we can say that the salaried workers and the university students are into social media more than retired and school students. It is because of the kind of data set as well. The proportion of the data of people between 20 to 29 is comparatively high. So we cannot make the conclusion here. If we had closely distributed data set we could come up with the good conculsion at this point. We will definately through in the future. 

In [None]:
#Get Number of Social Apps into a List
appsList = main_df.iloc[:, 7]

#Creating a variable to store # amount of apps
numberOfApps = []

#Creating a variable to store list of the split result string list
listOfApps = []

for app in appsList:
    listOfApps.append(app.split(";"))

#Add Number of Apps to DF

listOfNumberApps = [len(x) for x in listOfApps]

main_df['Number of Apps'] = listOfNumberApps
main_df.head()

In [None]:
# Use groupby to create the pivot tables.
New_test = main_df.loc[ :, ["1. What is your age?","2. Gender","4. Occupation Status","20. On a scale of 1 to 5, how often do you face issues regarding sleep?","Age Groups","Number of Apps","Average Time on Social Media"]]
sleep_deprivation =New_test.groupby(["Average Time on Social Media","Age Groups"])
sleep_deprivation_count = sleep_deprivation.size().reset_index(name='Rating Count per Age Group')
sleep_dep =sleep_deprivation_count.pivot(index="Average Time on Social Media", columns="Age Groups",values= 'Rating Count per Age Group')
sleep_dep['Total'] = sleep_dep.sum(axis=1)
sleep_dep.loc['Total'] = sleep_dep.sum(axis=0)
sleep_dep

In [None]:
# Drop the not needed column and row from the pivot table.
sleep_dep = sleep_dep.drop('Total', axis=0)
sleep_dep = sleep_dep.drop('Total', axis=1)

# Plot info
sleep_dep.plot(kind='bar',figsize=(13,8))
plt.xlabel('Age Group')
plt.ylabel('Response')
plt.title('People with sleep deprivation by Age Group')
plt.legend(title = 'Rating', loc='upper right', bbox_to_anchor=(1, 1))
plt.show()

In [None]:
app_and_hours =New_test.groupby(["Number of Apps"])
app_count = pd.DataFrame(app_and_hours["Average Time on Social Media"].count()).reset_index()
average_time =app_count["Average Time on Social Media"] 
number_of_app =app_count['Number of Apps']
(slope, intercept, rvalue, pvalue, stderr) = linregress(number_of_app, average_time)
regress_values = number_of_app * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(number_of_app, average_time)
plt.plot(number_of_app,regress_values,"r-")
plt.ylabel("Average Time on Social Media")
plt.xlabel("Number of Apps")
plt.show()

In [None]:
time_spent_on_social_media = []
for index, row in main_df.iterrows():
    if row['8. What is the average time you spend on social media every day?'] == 'Between 0 and 1 hours':
        time_spent_on_social_media.append(1)
    if row['8. What is the average time you spend on social media every day?'] == 'Less than an Hour':
        time_spent_on_social_media.append(0.5)
    elif row['8. What is the average time you spend on social media every day?'] == 'Between 1 and 2 hours':
        time_spent_on_social_media.append(1)
    elif row['8. What is the average time you spend on social media every day?'] == 'Between 2 and 3 hours':
        time_spent_on_social_media.append(2)
    elif row['8. What is the average time you spend on social media every day?'] == 'Between 3 and 4 hours':
        time_spent_on_social_media.append(3)
    elif row['8. What is the average time you spend on social media every day?'] == 'Between 4 and 5 hours':
        time_spent_on_social_media.append(4)
    elif row['8. What is the average time you spend on social media every day?'] == 'More than 5 hours':
        time_spent_on_social_media.append(5)
main_df['Time Spent'] = time_spent_on_social_media
main_df.head(2)

In [None]:
half_hour_df = main_df.loc[main_df["Time Spent"]== 0.5,: ]
one_hour_df = main_df.loc[main_df["Time Spent"]== 1,: ]
two_hour_df = main_df.loc[main_df["Time Spent"]== 2,: ]
three_hour_df = main_df.loc[main_df["Time Spent"]== 3,: ]
four_hour_df = main_df.loc[main_df["Time Spent"]== 4,: ]
five_hour_df = main_df.loc[main_df["Time Spent"]== 5,: ]
five_hour_df.head(2)

In [None]:
fig, (ax1, ax2, ax3, ax4, ax5, ax6) = plt.subplots(1, 6)
fig.set_size_inches(18, 10)
ax1.boxplot(half_hour_df['20. On a scale of 1 to 5, how often do you face issues regarding sleep?'])
ax2.boxplot(one_hour_df['20. On a scale of 1 to 5, how often do you face issues regarding sleep?'])
ax3.boxplot(two_hour_df['20. On a scale of 1 to 5, how often do you face issues regarding sleep?'])
ax4.boxplot(three_hour_df['20. On a scale of 1 to 5, how often do you face issues regarding sleep?'])
ax5.boxplot(four_hour_df['20. On a scale of 1 to 5, how often do you face issues regarding sleep?'])
ax6.boxplot(five_hour_df['20. On a scale of 1 to 5, how often do you face issues regarding sleep?'])
ax1.set_ylabel('Sleep Deprivation')
ax3.set_xlabel('Time Spent on social media')
ax3.set_title('Time Spent on social Media Vs Sleep Deprivation')
ax1.set_xticklabels(["Half hour"])
ax2.set_xticklabels(["one hour"])
ax3.set_xticklabels(["two hours"])
ax4.set_xticklabels(["three hours"])
ax5.set_xticklabels(["four hours"])
ax6.set_xticklabels(["Five hours"])
plt.savefig("output_data/timeSpentOnSocialMediaVsSleepDeprivation")
plt.show()

# Time Spent on social Media Vs Sleep Deprivation
1. From the box plot "Time spent on social media vs sleep deprivation", We can see how the rating are increasing with the time spent on the social media, from half hour to one hour we can see the increment in the ratings, there is not much difference in one and two hours but going to three hours to 4 hours we can see huge difference in the ratings.From this chart we can infer that the more time we spend on social media,the more sleep deprived we become.As a result, the depression total score increases as we spent more time on social media. Therefore, social media does have a negative impace on mental health.' 
2. There is still a lot to consider in this project like getting the more closely distributed data, big data sets and different other factors like weather, seasons etc. If had more time to explore on this topics we could come up with other interesting analysis as well. 