# Data Preprocessing

### Loading data

In [1]:
import pandas as pd
df = pd.read_csv("social_media.csv")

In [2]:
df.head(5)

Unnamed: 0,Timestamp,1. What is your age?,2. Gender,3. Relationship Status,4. Occupation Status,5. What type of organizations are you affiliated with?,6. Do you use social media?,7. What social media platforms do you commonly use?,8. What is the average time you spend on social media every day?,9. How often do you find yourself using Social media without a specific purpose?,...,11. Do you feel restless if you haven't used Social media in a while?,"12. On a scale of 1 to 5, how easily distracted are you?","13. On a scale of 1 to 5, how much are you bothered by worries?",14. Do you find it difficult to concentrate on things?,"15. On a scale of 1-5, how often do you compare yourself to other successful people through the use of social media?","16. Following the previous question, how do you feel about these comparisons, generally speaking?",17. How often do you look to seek validation from features of social media?,18. How often do you feel depressed or down?,"19. On a scale of 1 to 5, how frequently does your interest in daily activities fluctuate?","20. On a scale of 1 to 5, how often do you face issues regarding sleep?"
0,4/18/2022 19:18:47,21.0,Male,In a relationship,University Student,University,Yes,"Facebook, Twitter, Instagram, YouTube, Discord...",Between 2 and 3 hours,5,...,2,5,2,5,2,3,2,5,4,5
1,4/18/2022 19:19:28,21.0,Female,Single,University Student,University,Yes,"Facebook, Twitter, Instagram, YouTube, Discord...",More than 5 hours,4,...,2,4,5,4,5,1,1,5,4,5
2,4/18/2022 19:25:59,21.0,Female,Single,University Student,University,Yes,"Facebook, Instagram, YouTube, Pinterest",Between 3 and 4 hours,3,...,1,2,5,4,3,3,1,4,2,5
3,4/18/2022 19:29:43,21.0,Female,Single,University Student,University,Yes,"Facebook, Instagram",More than 5 hours,4,...,1,3,5,3,5,1,2,4,3,2
4,4/18/2022 19:33:31,21.0,Female,Single,University Student,University,Yes,"Facebook, Instagram, YouTube",Between 2 and 3 hours,3,...,4,4,5,5,3,3,3,4,4,1


In [3]:
df.columns

Index(['Timestamp', '1. What is your age?', '2. Gender',
       '3. Relationship Status', '4. Occupation Status',
       '5. What type of organizations are you affiliated with?',
       '6. Do you use social media?',
       '7. What social media platforms do you commonly use?',
       '8. What is the average time you spend on social media every day?',
       '9. How often do you find yourself using Social media without a specific purpose?',
       '10. How often do you get distracted by Social media when you are busy doing something?',
       '11. Do you feel restless if you haven't used Social media in a while?',
       '12. On a scale of 1 to 5, how easily distracted are you?',
       '13. On a scale of 1 to 5, how much are you bothered by worries?',
       '14. Do you find it difficult to concentrate on things?',
       '15. On a scale of 1-5, how often do you compare yourself to other successful people through the use of social media?',
       '16. Following the previous question, 

### Droping unnecessary columns

In [4]:
df = df.drop(columns=['5. What type of organizations are you affiliated with?','Timestamp',])

### Checking missing values

In [5]:
df.isnull().sum()

1. What is your age?                                                                                                    0
2. Gender                                                                                                               0
3. Relationship Status                                                                                                  0
4. Occupation Status                                                                                                    0
6. Do you use social media?                                                                                             0
7. What social media platforms do you commonly use?                                                                     0
8. What is the average time you spend on social media every day?                                                        0
9. How often do you find yourself using Social media without a specific purpose?                                        0
10. How often do you get

### Renaming columns

We grouped the questions by combining related questions. <br>
**ADHD** 
- Q10. How often do you get distracted by Social media when you are busy doing something?
- Q12. On a scale of 1 to 5, how easily distracted are you?
- Q14. Do you find it difficult to concentrate on things?

**Anxiety**
- Q9. How often do you find yourself using Social media without a specific purpose?
- Q11. Do you feel restless if you haven't used Social media in a while?
- Q13. On a scale of 1 to 5, how much are you bothered by worries?


**Depression**
- Q18. How often do you feel depressed or down?
- Q19. On a scale of 1 to 5, how frequently does your interest in daily activities fluctuate?
- Q20. On a scale of 1 to 5, how often do you face issues regarding sleep?

**Social Comparison**
- Q15. On a scale of 1-5, how often do you compare yourself to other successful people through the use of social media?
- Q16. Following the previous question, how do you feel about these comparisons, generally speaking?
- Q17. How often do you look to seek validation from features of social media?

In [6]:
df.rename(columns ={
                       '1. What is your age?':'Age',
                       '2. Gender':'Gender',
                       '3. Relationship Status':'Relationship Status',
                       '4. Occupation Status':'Occupation',
                       '6. Do you use social media?':'SM User?',
                       '7. What social media platforms do you commonly use?':'Platforms',
                       '8. What is the average time you spend on social media every day?':'Time Spent',
                       '9. How often do you find yourself using Social media without a specific purpose?':'Anxiety1',
                       '10. How often do you get distracted by Social media when you are busy doing something?':'ADHD1',
                       "11. Do you feel restless if you haven't used Social media in a while?":'Anxiety2',
                       '12. On a scale of 1 to 5, how easily distracted are you?':'ADHD2',
                       '13. On a scale of 1 to 5, how much are you bothered by worries?':'Anxiety3',
                       '14. Do you find it difficult to concentrate on things?':'ADHD3',
                       '15. On a scale of 1-5, how often do you compare yourself to other successful people through the use of social media?':'Social Comparison1',
                       '16. Following the previous question, how do you feel about these comparisons, generally speaking?':'Social Comparison2',
                       '17. How often do you look to seek validation from features of social media?':'Social Comparison3',
                       '18. How often do you feel depressed or down?':'Depression1',
                       '19. On a scale of 1 to 5, how frequently does your interest in daily activities fluctuate?':'Depression2',
                       '20. On a scale of 1 to 5, how often do you face issues regarding sleep?':'Depression3' },inplace=True)

#### Creating new column for Age

In [7]:
df['Age'] = df['Age'].astype(int)

In [8]:
def categorize_age(age):
    if age >= 51:
        return 'Above 50'
    elif 41 <= age <= 50:
        return '40-50'
    elif 31 <= age <= 40:
        return '30-40'
    elif 21 <= age <= 30:
        return '20-30'
    else:
        return 'Under 20'


df['Age group'] = df['Age'].apply(categorize_age)

In [9]:
df['Age group'] = pd.Categorical(df['Age group'], categories=["Under 20", "20-30", "30-40", "40-50", "Above 50"], ordered=True)
df = df.sort_values('Age group')

### Renaming values

In [10]:
df['Gender'] = df['Gender'].replace({'unsure ':'Unsure','Trans':'Other','There are others???':'Other','Nonbinary ': 'Non binary', 'Non-binary': 'Non binary','NB':'Non binary','Non binary ':'Non binary'})

In [11]:
df['Occupation'] = df['Occupation'].replace({'University Student':'Univeristy','School Student':'School','Salaried Worker': 'Worker'})

In [12]:
df['Relationship Status'] = df['Relationship Status'].replace({'In a relationship':'Relationship'})

In [13]:
df['Time Spent'] = df['Time Spent'].replace({
     'Less than an Hour': '<1 hour',
    'Between 1 and 2 hours': '1-2 hours',
    'Between 2 and 3 hours': '2-3 hours',
    'Between 3 and 4 hours': '3-4 hours',
     'Between 4 and 5 hours': '4-5 hours',
    'More than 5 hours': '5+ hours',
})

### Scaling and aggregating

Higher score equals worse mental health and for all questions except Question 16, the scale is as follows: <br>

1 --> very negative<br>
2 --> negative<br>
3 --> neutral<br>
4 --> positive<br>
5 --> very positive<br>

Only in question 16, a higher score equals  postive/better mental health, so we need to reverse the scale:<br>

5 --> very negative<br>
4 --> negative<br>
3 --> neutral<br>
2 --> positive<br>
1 --> very positive

In [14]:
df.loc[df['Social Comparison2'] == 1, 'Social Comparison2'] = 5
df.loc[df['Social Comparison2'] == 2, 'Social Comparison2'] = 4
df.loc[df['Social Comparison2'] == 3, 'Social Comparison2'] = 3
df.loc[df['Social Comparison2'] == 4, 'Social Comparison2'] = 2
df.loc[df['Social Comparison2'] == 5, 'Social Comparison2'] = 1

In [15]:
ADHD = ['ADHD1', 'ADHD2', 'ADHD3']
df['ADHD Score'] = df[ADHD].sum(axis=1)

Anxiety = ['Anxiety1', 'Anxiety2','Anxiety3']
df['Anxiety Score'] = df[Anxiety].sum(axis=1)

SocialComparison = ['Social Comparison1', 'Social Comparison2', 'Social Comparison3']
df['Social Comparison Score'] = df[SocialComparison].sum(axis=1)

Depression = ['Depression1', 'Depression2', 'Depression3']
df['Depression Score'] = df[Depression].sum(axis=1)

Total = ['ADHD Score', 'Anxiety Score', 'Social Comparison Score', 'Depression Score']
df['Total Score'] = df[Total].sum(axis=1)

df.drop(columns=ADHD + Anxiety + SocialComparison + Depression, inplace=True)

### Final data

In [16]:
df.head()

Unnamed: 0,Age,Gender,Relationship Status,Occupation,SM User?,Platforms,Time Spent,Age group,ADHD Score,Anxiety Score,Social Comparison Score,Depression Score,Total Score
78,18,Female,Single,Univeristy,Yes,"Facebook, YouTube, Reddit",3-4 hours,Under 20,11,12,8,15,46
191,19,Female,Single,School,Yes,"Facebook, Instagram, YouTube, Snapchat, Discor...",2-3 hours,Under 20,10,9,7,8,34
194,20,Male,Single,Univeristy,Yes,"Facebook, Instagram, YouTube, Snapchat, Discor...",3-4 hours,Under 20,13,13,8,14,48
387,19,Male,Single,School,Yes,"Facebook, Instagram, YouTube, Snapchat, Discor...",4-5 hours,Under 20,12,10,11,10,43
99,16,Male,Single,School,Yes,"Facebook, Instagram, YouTube, Snapchat, Discor...",4-5 hours,Under 20,14,9,8,6,37


### Counting each platform occurrence

In [17]:
df['Platforms'] = df['Platforms'].str.split(', ')

In [18]:
df_exploded = df.explode('Platforms')
platform_counts = df_exploded['Platforms'].value_counts()

print(platform_counts)

Platforms
YouTube      412
Facebook     407
Instagram    359
Discord      198
Snapchat     181
Pinterest    145
Twitter      131
Reddit       126
TikTok        94
Name: count, dtype: int64


### Downloading all dataframes as  csv

In [19]:
df.to_csv('SM.csv', index=False)
df_exploded.to_csv('SM1.csv', index=False)