### Case Study 1

Assume you are new to the data science field, and you want to find out what real practitioners and wannabe data scientists are concerned about. One place where you may find such information is Twitter. However, Twitter users often use their real identities and may have reservations about sharing all their opinions publicly. Another place where such information maybe found is the datascience subreddit on Reddit.com (https://www.reddit.com/r/datascience/). Users are assumed to be anonymous and they are more likely to share their opinions without reservations. To find out common concerns among the datascience subreddit users, it might be a good idea to collect the top 100 posts in the subreddit in the year 2021. You might also collect the top 3 comments of each of those posts. In this case study, we will do exactly that. Specific details can be found in the next few cells. 

This data can be used for many different projects. However, we are only going to focus on the "data gathering" part. We will also do some cleaning.

**Note**: This case study contributes 12.5% to your overall grade.

## Step 1: 
###  15 points


**Description:** 

Learn about the **praw** package for Python and learn how you can use it to load reddit posts, comments etc. on a Jupyter Notebook. Do a Google search. You might find tutorials. It is okay to use them. You may need to use secret keys for this part. For that you will need to open a Reddit account. You can use a throwaway account for this purpose. Write your code in the cell below. Any code you write to retrieve data from Reddit can go there.

**Grading criteria:** 

The code for this step must be correct. Otherwise, the next steps cannot be completed. In that case, the next steps will not be graded. If you receive a praw object from the data science subreddit, you will get full 15 points.' Other methods may be considered, but not encouraged.

In [1]:
# your code for step 1 goes here

# requests module = HTTP library. It allows to send HTTP requests using Python
import requests
import requests.auth #Reddit requires HHTP Basic Auth

client_auth = requests.auth.HTTPBasicAuth('ZRB4PJGg69F5VA4fyZoTRw', 'e7zeyU4iRvXnLpb-x3hr3cJfLxPGdQ') # parameters = ('user', 'pass')

# I delete my user name and password for security purposes. If anyone want to run this code, that person should create a Reddit API
post_data = {"grant_type":"password", "username":"xxxx", "password":"xxxxx"} 

# unique and descriptive user agent:
headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"}

# acquiring a token / post() method to send some data to the server
response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)

# returns a JSON object 
response.json()


{'access_token': '612393978331-zMNMVTwxwhW6X62WgcApKaWyg98HoQ',
 'token_type': 'bearer',
 'expires_in': 3600,
 'scope': '*'}

In [2]:
# using the token to access headers
headers = {"Authorization": "bearer 612393978331-zMNMVTwxwhW6X62WgcApKaWyg98HoQ", "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"}
response = requests.get("https://oauth.reddit.com/api/v1/me", headers=headers)
print(response.status_code) 
response.json()

200


{'is_employee': False,
 'seen_layout_switch': False,
 'has_visited_new_profile': False,
 'pref_no_profanity': True,
 'has_external_account': False,
 'pref_geopopular': '',
 'seen_redesign_modal': False,
 'pref_show_trending': True,
 'subreddit': {'default_set': True,
  'user_is_contributor': False,
  'banner_img': '',
  'restrict_posting': True,
  'user_is_banned': False,
  'free_form_reports': True,
  'community_icon': None,
  'show_media': True,
  'icon_color': '#7EED56',
  'user_is_muted': False,
  'display_name': 'u_DS_501_A',
  'header_img': None,
  'title': '',
  'coins': 0,
  'previous_names': [],
  'over_18': False,
  'icon_size': [256, 256],
  'primary_color': '',
  'icon_img': 'https://www.redditstatic.com/avatars/defaults/v2/avatar_default_3.png',
  'description': '',
  'submit_link_label': '',
  'header_size': None,
  'restrict_commenting': False,
  'subscribers': 0,
  'submit_text_label': '',
  'is_default_icon': True,
  'link_flair_position': '',
  'display_name_prefixed'

In [3]:
#I have installed praw module using "pip install praw" in Anaconda Prompt
import praw

# create a read-only Reddit instance to retrieve public information from Reddit
reddit = praw.Reddit(
    client_id="ZRB4PJGg69F5VA4fyZoTRw",
    client_secret="e7zeyU4iRvXnLpb-x3hr3cJfLxPGdQ",
    user_agent= headers,
)

print(reddit.read_only)

True


## Step 2: 
### 10 + 20 + 10 + 15 + 5 + 5 = 65 points

**Description**:
Once you have the mechanism in place to retrieve data from Reddit, you next step is to determine which parts of the data is necessary. For this case study, collect only the top posts from the year 2021. Also consider if the score of each post was above 50 or not. If the score was below 50, it might not have been an important post. Do not consider those posts. 

You may also observe that sometimes posts with memes or jokes get a lot of 'upvotes,' and because of that they may  have high scores, but they may not be useful for this case study. To address this problem, you will simply get rid of any post that has fewer than 5 words in the title. 

You will also notice that praw returns time as an integer. It is inconvenient for us to read time like that. You may want to convert the integer time to human readable time. You do not need to mention hours, minutes, or seconds. Just year, month and date is enough.

**Grading Criteria:**
* posts are only from the year 2021: 10 points
* the integer time format converted into year-month-day: 20 points
* only posts with scores more than 50 were considered: 10 points
* only post titles with more than 5 words were kept: 15 points
* minimum 100 posts were collected: 5 points
* three comments collected for each post: 5 points

Note: All six grading criteria can be satified by writing one line or many lines of code. It does not matter. As long as your code satisfies the six criteria (in one line or many lines), you will get full points. Otherwise, you will get partial credits.

In [4]:
# your code for step 2 goes here

from datetime import datetime
import emoji

# obtain a Subreddit instance about Data Science
subreddit = reddit.subreddit('datascience')

# top() method to sort posts based on top 
top_data_science = subreddit.top(limit= 500) 


# Function to convert integer time format into year-month-day
def time_converter (time_in_int):
    time = time_in_int
    time_converted = datetime.utcfromtimestamp(time)
    return time_converted

# Function to check if title has emojis
def has_emoji(text):
    for letter in text:
        if (letter in emoji.UNICODE_EMOJI["en"]):
            return True
    return False


list_submission_date = []
list_submission_score = []
list_submission_title = []
list_submission_com1 = []
list_submission_com2 = []
list_submission_com3 = []

for submission in top_data_science:
    time = time_converter(submission.created_utc) # call to time_converter() function
    limit_time = datetime.fromisoformat('2021-01-01 00:00:00') # set the time limit to 01/01/2021
    score = submission.score # assing the post's score to a variable
    if time >= limit_time and score > 50: 
        title = submission.title
        words = len(title.split())
        emojis = has_emoji(title) # call to has_emoji() function
        if words > 5 and emojis == False:
            list_submission_date.append(time)
            list_submission_score.append(submission.score)
            list_submission_title.append(submission.title)
            
            submission.comments.replace_more(limit=0) # method replace_more() replaces or removes MoreComments objects (replies)
            submission.comments_sort = 'top'
            list_submission_com1.append(submission.comments[0].body)
            list_submission_com2.append(submission.comments[1].body)
            list_submission_com3.append(submission.comments[2].body)


In [5]:
#checking that the data has been loaded correctly
print(list_submission_date[0:10])
print(list_submission_score[0:10])
print(list_submission_title[0:10])

[datetime.datetime(2021, 7, 12, 14, 39, 38), datetime.datetime(2021, 2, 14, 3, 0, 3), datetime.datetime(2021, 4, 8, 19, 22, 16), datetime.datetime(2021, 7, 26, 13, 6, 23), datetime.datetime(2021, 3, 25, 13, 19, 39), datetime.datetime(2021, 8, 18, 6, 34, 5), datetime.datetime(2021, 4, 12, 4, 9, 25), datetime.datetime(2021, 8, 19, 16, 1, 5), datetime.datetime(2021, 6, 20, 13, 58, 29), datetime.datetime(2021, 6, 7, 14, 33, 48)]
[2646, 2184, 1729, 1630, 1396, 1371, 1217, 1202, 1131, 1019]
['how about that data integrity yo', 'I created a four-page Data Science Cheatsheet to assist with exam reviews, interview prep, and anything in-between', "I just got offered a data science internship with Amazon. I've been lurking on the sub for 3 years and just wanted to thank the folks who put together stats/ml cheat sheets.", 'Me showing off a suspiciously well-performing model [OC]', 'Alan Turing is the new face on the British £50 note', 'Very proud of my CS book collection.', 'I found a research pap

## Step 3: 
### 10 points

Save the data on your local disk. You may have used lists or similar data structures for the intial porcessing. Convert that data structure into a Pandas dataframe. Save the dataframe as a .csv file into your local disk. 

Here are the column details:

Column 1: Date

Column 2: Post score

Column 3: Post title

Column 4: Top comment 1

Column 5: Top comment 2

Column 6: Top comment 3

When you create the .csv file, it should have 101 rows (including column names) and 6 columns.

**Grading criteria:**
If your code produces a .csv file in the local disk in the same folder as the Jupyter Notebook file, you get full points. Otherwise, no point.

In [6]:
# your code for step 3 goes here
import pandas as pd

df_submission = pd.DataFrame(list(zip(list_submission_date, list_submission_score, list_submission_title,
                                     list_submission_com1, list_submission_com2, list_submission_com3)), 
                             columns=['Date', 'Score', 'Title', 'Top Com 1', 'Top Com 2', 'Top Com 3' ])
df_submission.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135 entries, 0 to 134
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       135 non-null    datetime64[ns]
 1   Score      135 non-null    int64         
 2   Title      135 non-null    object        
 3   Top Com 1  135 non-null    object        
 4   Top Com 2  135 non-null    object        
 5   Top Com 3  135 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 6.5+ KB


In [7]:
df_submission.head(30)

Unnamed: 0,Date,Score,Title,Top Com 1,Top Com 2,Top Com 3
0,2021-07-12 14:39:38,2646,how about that data integrity yo,"If you find a good data engineer, you do every...",What are some examples of differences between ...,The true heroes
1,2021-02-14 03:00:03,2184,I created a four-page Data Science Cheatsheet ...,Nice work! Maybe consider adding another page ...,Doing the Lord’s work out here. Thank you so m...,"Oh man, I have a test coming up in data analyt..."
2,2021-04-08 19:22:16,1729,I just got offered a data science internship w...,Congrats on the offer. Amazon is a great first...,Mind sharing some of the cheat sheets?,congratulations! i recently switched to data s...
3,2021-07-26 13:06:23,1630,Me showing off a suspiciously well-performing ...,At a corporate presentation a consultant showc...,Pls respond,Thanks for checking out the comic! This idea c...
4,2021-03-25 13:19:39,1396,Alan Turing is the new face on the British £50...,The terrible things this genius went through :'(,Perhaps the man who both contributed more to c...,What are some other computer scientists like A...
5,2021-08-18 06:34:05,1371,Very proud of my CS book collection.,How many of those have you read? Which would y...,"Top tip, if you sleep on it the information wi...",I see the hungry caterpillar sneaking in there...
6,2021-04-12 04:09:25,1217,I found a research paper that is almost entire...,Send an email to the editor of the journal. In...,Holy shit they aren't even trying to hide it. ...,"Yeah, this is bizarre. It's not a published jo..."
7,2021-08-19 16:01:05,1202,"The Key Word in Data Science is Science, not Data",I can't tell you how many times I've backed ou...,[deleted],"It is also important to understand what that ""..."
8,2021-06-20 13:58:29,1131,Hi! I just expanded the Data Science Cheatshee...,Wow! Thank you!,Jumping in to say that your sheet just might h...,This is an excellent resource for reviewing ML...
9,2021-06-07 14:33:48,1019,Data Science and Data Analytics is becoming ul...,Lots of companies also think that Data Scienti...,I’m actually considering switching to data eng...,Data science is different now (https://veekayb...


In [8]:

df_submission.to_csv('DS_reddit_top_post.csv', index=False)

In [24]:
#looking for insightful posts
print('Date: ', df_submission.at[83, 'Date'])
print('Score: ', df_submission.at[83, 'Score'])
print('Title: ', df_submission.at[83, 'Title'])
print('Top Com1: ', df_submission.at[83, 'Top Com 1'])
print('Top Com2: ', df_submission.at[83, 'Top Com 2'])
print('Top Com3: ', df_submission.at[83, 'Top Com 3'])

Date:  2021-01-25 22:27:16
Score:  380
Title:  Did anyone regret choosing DS as a career or has got disillusioned with it?
Top Com1:  I am in no way as experienced as you. However, I have heard from a VP of Data who told me that he formed his department to have 3 paths. One path is to become a manager of junior data analysts if they enjoy managing people. One path is to become an ML expert if they enjoy the technical aspect. One path is to become a business analyst if they enjoy the product/business aspect. 

From your post, it sounds like you went down the ML expert route but you're unhappy with it. Data Science is so broad, that you won't have to exit DS completely to find enjoyment in your work. I think if you wanted to, you could try to transfer your skills into the other paths. Maybe in those paths will you find out whether you're meat or fish!
Top Com2:  I don't regret it, since it's a safe job and pays better than most careers.

However, I can empathize with the lonely/unfulfill

## Step 4:
### 10 points
#### Presentation slides:
   
Create presentation slides for this case study. The presentation slides should provide an overview of the problem you tried to solve, methods you have used (don't put actual code in the slides), and if you have discovered new insights from the data you have collected. You may put actual post titles or comments in the slide that you found insightful. The number of slides should be around 6-7 (no hard limit). Three of you will be randomly chosen and be asked to present your work in the class. You should be prepared to present your work for 5 mins.

**Notes on grading**: 5 points will be deducted if you are not prepared to present on the day of submission.

### What to submit:

Put the Jupyter Notebook file and the .csv file in a folder. Then convert your presentation slides in to a PDF file and put it in the same folder. Zip the folder. After zipping, it should have the extension .zip. The name of the .zip file should be firstname_lastname_casestudy_1.zip . Upload the .zip file on Canvas.

