<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 -  Web APIs & NLP

# Contents (Part 1) - This notebook 

- [Executive Summary](#Executive-Summary)
- [Problem Statement](#Problem-Statement)
- [Background and Research](#Background-and-Research)
- [Data Fetching](#Data-Fetching)
- [Data Filtering (Apple)](#Data-Filtering-(Apple))
- [Data Filtering (Samsung)](#Data-Filtering-(Samsung))

# Contents (Part 2)

- Data Exploration
- Additional Data Filtering
- Data Dictionary
- Natural Language Processing (Apple)
- Natural Language Processing (Samsung)

# Contents (Part 3)

- Preprocessing
- Dummy Classifier
- Multinomial Naive-Bayes
- Logistic Regression
- Support Vector Classifier
- Random Forest
- Results and Analysis
- Conclusion and Recommendations

# Executive Summary

Anyone tracking the latest trends in technology would have a need for tools and metrics to determine a brand's Share of Voice in the market. This includes writers, YouTubers, tech enthusiasts and investors. With Apple and Samsung being the notable heavyweights in the smartphone domain, the battle has become increasingly heated with more eyes on the two. To this end, the Editor-in-chief of a tech magazine has asked for a classifer tool to scan online tech forums to see if more people are talking about Apple or Samsung. 

A project was undertaken to build this tool using machine learning. The Pushshift API was used to pull posts from the subreddits r/Apple and r/Samsung. The API is a data-fetching tool created by the moderators of r/datasets. After some data processing, 4000 of the most recent posts from each subreddit were used for modelling, totalling 8000. The text of the title and post body were combined and taken as the text for each post. Before modelling, Natural Language Processing indicated that many people tended to ask product-troubleshooting questions on the subreddits.

For modelling, the Multinomial Naive-Bayes model, Logistic Regression model, Support Vector Classifier model and Random Forest model were tested and optimised. The chosen model was Logistic Regression. The classifiers were primarily evaluated by their accuracy score, and a successful model should have an accuracy score of at least 0.8. The accuracy score represents how many posts the model predicted correctly, divided by the total number of predictions made. 

Logistic Regression produced an accuracy score of 0.956 on the testing subset of the data, with 0.973 on the training subset. This was a tie with the Support Vector Classifier, but its score of 0.996 on the training subset showed that the Support Vector Classifier overfitted more to the training data. This made Logistic Regression a more favourable choice. Logistic Regression was also more interpretable than the Support Vector Classifier, being able to show which words were the strongest predictors for classification into a particular subreddit. In this case, it was found that the brand and product names were the strongest predictors. 

To sum up, the project endpoint was achieved, and the tool is ready to be deployed and further tested by the tech magazine.

# Problem Statement 

You are a data science professional working for a tech magazine discussing the latest trends in technology. The Editor-in-chief wants to come up with a big year-end story that reveals the most-talked about tech brand of the year. She is intending to scan online tech forums to see if more people are talking about Apple or Samsung. However, she cannot possibly scroll through the forums manually. 

As the first step to tackling this, you have been directed to build a text classifier that can detect if any given online post is talking about Apple or Samsung. This classifier will be primarily evaluated by its accuracy score, and a successful model should have an accuracy score of at least 0.8.

# Background and Research

Share of Voice is the measure of how much of the market represents your brand, and can be measured by metrics like keywords ([*source*](https://sproutsocial.com/glossary/share-of-voice)). The tech market has been dominated by a few major players, and Apple and Samsung are the notable heavyweights in the smartphone domain. 

Although the iPhone has the dominant share of voice in general, when the share of voice was split into the topics of cost, design and features, it was found in an analysis that Samsung was leading in share of voice regarding cost and design. The iPhone led in share of voice regarding features ([*source*](https://hottopics.ht/9526/iphone-losing-smartphones-share-of-voice-war/)). From the same analysis, it was discussed that the ubiquitiy of Apple may have led to people talking about other brands instead. 

It is thus important for any analyst of tech trends to keep abreast of the current changes in the market. Apple and Samsung are direct competitors in the mobile phone space, releasing competing flagship models within the same frame of time ([*source*](https://www.which.co.uk/reviews/mobile-phones/article/apple-iphone-vs-samsung-galaxy-mobile-phones-aZL5V5m4UGbw)). Apple's market share based on sales figures was 20.8% in in Q4 2020, while Samsung's market share was 16.2% ([*source*](https://www.zdnet.com/article/apple-vs-samsung-who-makes-a-better-smartphone/)). There is thus a close fight between the two, and tech users will be eager to know what is the latest state of affairs between the two companies. 

Apple had $111 billion in sales for the Christmas of 2020 ([*source*](https://www.bbc.com/news/business-55835504)). It also has an App Store with its own mobile app ecosystem. Meanwhile, Samsung is trying to keep up by improving the features on its phones, such has having a fingerprint sensor embedded in the screen ([*source*](https://www.businessinsider.com/samsung-galaxy-s20-ultra-apple-iphone-11-features-specs-camera-2020-2#higher-resolution-camera-sensors-3)) and fast charging ([*source*](https://www.samsung.com/us/support/answer/ANS00062589/)). 

The Pushshift API it can be used to obtain the data from the subreddits. It is a tool made by the moderator team of r/datasets ([*source*](https://github.com/pushshift/api)). The text that forms the body of the posts will play a large role in the analysis. Sentences are about 15-20 words long, so there is a guideline in mind when determining if the length of a segment of text contains at least a sentence ([*source*](https://techcomm.nz/Story?Action=View&Story_id=106)).

# Data Fetching

## Importing necessary libraries

The libraries below will be used in this notebook.

In [1]:
import requests
import time
import pandas as pd

# Enables Pandas to display all the columns
pd.set_option('display.max_columns', None)

# Enables Pandas to display all the rows
pd.set_option('display.max_rows', None)

# Enables Pandas to display more text in a column
pd.set_option('display.max_colwidth', 100)

## The Apple subreddit 

When fetching the posts, there were a large number of posts with '[removed]' under the `selftext`. The `selftext` is the body text of the post. This means that posts without a `selftext` only consist of a title (and possibly comments). For this project, only the posts themselves were fetched. The comments were not fetched. The Pushshift API allows us to specify fetching only posts where the `selftext` is not a certain string. Also, there were a large number of posts with NaN `selftext` which would be filtered out anyway. Hence 40,000 posts were fetched for the filtering. The code below was used for the process:

In [2]:
# URL for the API call.
url = 'https://api.pushshift.io/reddit/search/submission'

# Parameters for the API call. 
# The maximum number of posts per pull is 100.
# 'selftext:not' means pulling only posts where 'selftext' is not the specified string.
params = {
    'subreddit': 'apple',
    'size': 100,
    'selftext:not': '[removed]'
}

# Initialise an empty list to store the data.
frames = []

# Starting counter for the number of pulling loops.
frame_count = 0

# Set up a while loop that will continue till the desired number is reached. 
# Setting it to run while 'frame_count < 400' means 400 loops will be done.
# The counter stops at 399, but the first loop when 'frame_count = 0' is counted.
# 400 x 100 posts will be fetched, which is 40,000.
while frame_count < 400:
    
    # Uses the requests library to get the data.
    res = requests.get(url, params)
    
    # Formats the data in .json.
    data = res.json()
    
    # Converts the data to a Pandas DataFrame.
    frame = pd.DataFrame(data['data'])
    
    # Appends the DataFrame to 'frames', that stores the data.
    frames.append(frame)
    
    # Increases the counter by 1.
    frame_count += 1
    
    # Looks at the 'created_utc' of the last row (the past post) of the currently fetched data. 
    # Sets the 'before' parameter to start the next loop fetching only posts before that post.
    try:
        params['before'] = frame.tail(1).iloc[0]['created_utc']
        
    # Handling the possible IndexError:
    except IndexError:        
        print('IndexError occured')
        
    # Sets a sleep timer of one second between requests to prevent overloading.
    time.sleep(1)

# Concatenates the collected data into one DataFrame.
df_apple = pd.concat(frames, ignore_index=True)

In [23]:
df_apple.shape

(39850, 95)

In [4]:
df_apple.to_csv('../data/df_apple.csv', index=False)

## The Samsung subreddit

When fetching the posts, there were a large number of posts with '[removed]' under the `selftext`. Hence, as with Apple, they were also removed by specifying the `selftext:not` parameter in the API call. Although there were some posts with NaN `selftext` which would be filtered out, the number is not as high as with Apple. Hence 10,000 posts were fetched for the filtering. The code below was used for the process:

In [5]:
# URL for the API call.
url = 'https://api.pushshift.io/reddit/search/submission'

# Parameters for the API call. 
# The maximum number of posts per pull is 100.
# 'selftext:not' means pulling only posts where 'selftext' is not the specified string.
params = {
    'subreddit': 'samsung',
    'size': 100,
    'selftext:not': '[removed]'
}

# Initialise an empty list to store the data.
frames = []

# Starting counter for the number of pulling loops.
frame_count = 0

# Set up a while loop that will continue till the desired number is reached. 
# Setting it to run while 'frame_count < 100' means 100 loops will be done.
# The counter stops at 99, but the first loop when 'frame_count = 0' is counted.
# 100 x 100 posts will be fetched, which is 10,000.
while frame_count < 100:
    
    # Uses the requests library to get the data.
    res = requests.get(url, params)
    
    # Formats the data in .json.
    data = res.json()
    
    # Converts the data to a Pandas DataFrame.
    frame = pd.DataFrame(data['data'])
    
    # Appends the DataFrame to 'frames', that stores the data.
    frames.append(frame)
    
    # Increases the counter by 1.
    frame_count += 1
    
    # Looks at the 'created_utc' of the last row (the past post) of the currently fetched data. 
    # Sets the 'before' parameter to start the next loop fetching only posts before that post.
    try:
        params['before'] = frame.tail(1).iloc[0]['created_utc']
        
    # Handling the possible IndexError:
    except IndexError:        
        print('IndexError occured')
        
    # Sets a sleep timer of one second between requests to prevent overloading.
    time.sleep(1)

# Concatenates the collected data into one DataFrame.
df_samsung = pd.concat(frames, ignore_index=True)

In [11]:
df_samsung.shape

(9995, 89)

In [7]:
df_samsung.to_csv('../data/df_samsung.csv', index=False)

# Data Filtering (Apple)

## Loading the data 

The dataset from the Apple subreddit will be loaded. We will only select certain columns.

In [8]:
ap = pd.read_csv('../data/df_apple.csv', usecols=['subreddit',
                                                  'author',
                                                  'created_utc',
                                                  'selftext',
                                                  'title',
                                                  'author_flair_css_class'])

In [9]:
ap.head(5)

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
0,gordonmcdowell,,1635298827,,apple,"Apple's M1 Pro, M1 Max SoCs “…you’d have to bring out server-class hardware to get ahead of the ..."
1,Arun_Sampath,,1635297851,,apple,M1 Max Macbook Pro vs RTX 3070 - Blender Render Times - Viewpoint FPS
2,eltarekc,,1635285305,,apple,شركة الطارق كلين تنظيف فلل بالمزاحمية
3,Gingerstrands,,1635283070,"I went to the Apple Store today to look at the pros in person. To be honest, I’m even more uncer...",apple,"To anyone like me who was wondering: yes, you can fit two apps side by side on the 14” comfortab..."
4,BiovelaBoomer,,1635282964,"What is the correct way to use the battery pack, would it be plugging in the cable into the ipho...",apple,Correct way to use battery pack?


## Filtering the data 

### Removing rows with null values in the `selftext` column

In [10]:
ap['selftext'].isna().value_counts()

True     30963
False     8887
Name: selftext, dtype: int64

There are rows with null values in the `selftext` column. We will drop the rows:

In [11]:
ap.dropna(subset=['selftext'], inplace=True)

There are no more null values in the `selftext` column:

In [12]:
ap['selftext'].isna().value_counts()

False    8887
Name: selftext, dtype: int64

### Removing rows with null values in the `title` column

There are no rows with null values in the `title` column.

In [13]:
ap['title'].isna().value_counts()

False    8887
Name: title, dtype: int64

### Dropping duplicates (according to `title` and `selftext`)

In [14]:
ap.shape

(8887, 6)

There are 8887 rows in the DataFrame. Dropping duplicates:

In [15]:
ap.drop_duplicates(subset=['title', 'selftext'], inplace=True)

The number is now different as some duplicate rows were dropped:

In [16]:
ap.shape

(8460, 6)

### Removing posts with [removed] as the `selftext`

There are no posts with '[removed]' as the `selftext` as we have already filtered them out at the webscraping stage.

In [17]:
(ap['selftext'] == '[removed]').value_counts()

False    8460
Name: selftext, dtype: int64

### Removing posts with [deleted] as the `selftext`

There are a number of posts with '[deleted]' as the `selftext`:

In [18]:
(ap['selftext'] == '[deleted]').value_counts()

False    7791
True      669
Name: selftext, dtype: int64

Removing the rows:

In [19]:
ap = ap[ap['selftext'] != '[deleted]']

There are no more rows with '[deleted]' as the `selftext`.

In [20]:
(ap['selftext'] == '[deleted]').value_counts()

False    7791
Name: selftext, dtype: int64

### Removing completely empty posts

There are no posts that have an empty string as the `selftext`:

In [21]:
(ap['selftext'] == '').value_counts()

False    7791
Name: selftext, dtype: int64

### Removing AutoModerator posts

There are a number of posts by 'AutoModerator'.

In [22]:
(ap['author'] == 'AutoModerator').value_counts()

False    7334
True      457
Name: author, dtype: int64

The posts are as follows:

In [23]:
ap[ap['author'] == 'AutoModerator'].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
28,AutoModerator,,1635242418,Welcome to the Daily Advice Thread for /r/Apple. This thread can be used to ask for technical ad...,apple,Daily Advice Thread
1053,AutoModerator,,1633406479,Today marks 10 years since the passing of Steve Jobs and we wanted to create a space here for th...,apple,Remembering Steve Jobs
2536,AutoModerator,,1630674022,"Hi r/Apple, welcome to today's megathread to discuss Apple's new CSAM on-device scanning.\n\nAs ...",apple,Daily Megathread - On-Device CSAM Scanning
3461,AutoModerator,,1628859620,"Hi r/Apple, welcome to today's megathread to discuss Apple's new CSAM on-device scanning.\n\nAs ...",apple,Daily Megathread - On-Device CSAM Scanning
3537,AutoModerator,,1628686821,"Hi r/Apple, welcome to today's megathread to discuss Apple's new CSAM on-device scanning.\n\nAs ...",apple,CSAM Daily Megathread


They appear to be moderator posts opening discussion threads and will be filtered out:

In [24]:
ap = ap[ap['author'] != 'AutoModerator']

There are no more posts by 'AutoModerator'.

In [25]:
(ap['author'] == 'AutoModerator').value_counts()

False    7334
Name: author, dtype: int64

### Removing exjr_'s posts

There are a number of posts by 'exjr_'.

In [26]:
(ap['author'] == 'exjr_').value_counts()

False    7257
True       77
Name: author, dtype: int64

The posts are as follows:

In [27]:
ap[ap['author'] == 'exjr_'].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
76,exjr_,,1635181485,"#It's update day! \n\nI'll update the post when OTA is out. In the meantime, IPSW files are avai...",apple,"Apple releases iOS/iPadOS 15.1, watchOS 8.1, audioOS 15.1, tvOS 15.1!"
78,exjr_,,1635181358,Here's Monterey's landing page: https://www.apple.com/macos/monterey/\n\nFull list of features: ...,apple,Apple releases macOS Monterey!
97,exjr_,,1635167750,Sorry for being late here. Updating as fast as I can.,apple,"Unboxing, First Impressions &amp; Review Megathread | MacBook Pro Late 2021 (14-inch, and 16-inch)"
431,exjr_,,1634605416,Hey /r/Apple!\n\nIn an attempt to curb the number of threads about pre-order choices/configurati...,apple,"Pre-Order and Shipping Megathread | MacBook Pro Late 2021 (14-inch, and 16-inch) &amp; AirPods 3..."
713,exjr_,,1634130061,Embargo has been lifted making this thread to links to the major media outlets unboxing/reviewin...,apple,"Unboxing, First Impressions &amp; Review Megathread | Apple Watch Series 7"


They appear to be moderator posts opening discussion threads and will be filtered out:

In [28]:
ap = ap[ap['author'] != 'exjr_']

There are no more posts by 'exjr_'.

In [29]:
(ap['author'] == 'exjr_').value_counts()

False    7257
Name: author, dtype: int64

### Removing moderator posts 

Using the `author_flair_css_class` column, we can see that there are more moderator posts.

In [30]:
ap['author_flair_css_class'].value_counts(dropna=False)

NaN          7208
moderator      49
Name: author_flair_css_class, dtype: int64

The posts are as follows:

In [31]:
ap[['title', 'selftext', 'author']][ap['author_flair_css_class'] == 'moderator'].head(5)

Unnamed: 0,title,selftext,author
484,"Apple's ""Unleashed"" | Post-Event Megathread","Hello r/Apple and welcome to the post-event megathread for Apple's ""Unleashed"" event\n\nLet us k...",aaronp613
495,"Apple's ""Unleashed"" | Event Megathread","# GOOD MORNING! GOOD MORNING! GOOOOOOD MORNING!\n\n## What to expect:\n\n* 14"" MacBook Pro (M1X ...",aaronp613
509,"Apple's ""Unleashed"" | Pre-Event Megathread","## GOOD MORNING, r/Apple!\n\n## Welcome to Apple's ""Unleashed"" Pre-Event Megathread!\n\n[Only a ...",aaronp613
531,"Will Apple think ""you are going to love it?""",\n\n[View Poll](https://www.reddit.com/poll/qa8ppr),aaronp613
543,"PSA: How submissions will work tomorrow during the ""Unleashed"" Apple event","Hey [r/Apple](https://old.reddit.com/r/Apple),\n\nWe are 1 day away from Apple's ""Unleashed"" eve...",aaronp613


They appear to be moderator posts opening discussion threads and will be filtered out:

In [32]:
ap = ap[ap['author_flair_css_class'] != 'moderator']

There are no more moderator posts.

In [33]:
ap['author_flair_css_class'].value_counts(dropna=False)

NaN    7208
Name: author_flair_css_class, dtype: int64

### Removing links

There are links in the data:

In [34]:
ap[ap['selftext'].str.contains('http://', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
8231,chestdrop,,1618748178,"After 1 year of hard work, we’ve launched our first beta of Moonshot, social network RPG, where ...",apple,"BETA: Moonshot is not your average social network: it has only original, sincere content and it'..."
8802,DrSteveBrule_,,1616979959,"[Demo Video](https://youtu.be/qjbHOk4Bl0A)\n\nFor the last 3 years, we have been developing an a...",apple,[Self-promotion] I made an AI powered pill identification app after almost dying from an incorre...
12232,chufucious,,1607892663,Hi all!\n\nCaptioning videos can be time consuming.\n\nI created Tap Tap Cap to make the quickes...,apple,Tap Tap Cap - The easiest way to time text on videos!
15442,rkstk,,1602434397,"Hi all, developer here. Based on your feedback from 3 months ago I added quite a few things to t...",apple,Life Notes: personal notes and knowledge base app (beta)
15456,rkstk,,1602423531,"Hi all, developer here. Based on your feedback from 3 months ago I added quite a few things to t...",apple,3 months later: update for Life Notes writing app now with support for reminders


In [35]:
ap[ap['selftext'].str.contains('www.', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
139,_dave_maxwell_,,1635077210,[TextPhoto](https://apps.apple.com/app/id1581624527) is the mobile app that you can use to conve...,apple,I made the app to convert pictures into word artworks (typography effect)
144,stefanvd,,1635069223,"Hello Apple fans,\n\nHope you have a great autumn Sunday.\n\nI love to tell you about my new Saf...",apple,"New Safari extension that customizable your Safari new tab page (with video background, image, a..."
153,happybuy,,1635049003,"Hey redditors, after a long time in development (and lots of requests!!) we’ve released an updat...",apple,Block YouTube ads in Safari with Magic Lasso Adblock v3.1
534,stefanvd,,1634502602,"Hello Apple fans,\n\nHope you have a great autumn Sunday.\n\nI love to tell you about my new Saf...",apple,New Safari extension that customizable your Safari new tab page for iOS 15 and macOS 12 Monterey
542,TimBurbanks1970,,1634490563,Some mobile games developers are already on the move after the Epic / Apple lawsuit… several gam...,apple,"Apple &lt;&gt; Epic lawsuit - Developers already on the move, how will Apple react?"


Links in the title and selftext will be removed using the following code:

In [36]:
# Using a Regex expression to match a string with the following format: 
# www.example(rest of link) (or starting with http:// or https://). Case insensitive. 
# Replaces the string with a space. Does not target links not starting with http or www.  
ap['selftext'] = ap['selftext'].str.replace(r'(http|www[.])[\S]*', 
                                            ' ', 
                                            regex=True, 
                                            case=False)
ap['title'] = ap['title'].str.replace(r'(http|www[.])[\S]*', 
                                      ' ', 
                                      regex=True, 
                                      case=False)

# Using a Regex expression to match a string with the following format:
# example.com (or ending with .net or .org). Case insensitive. 
# Replaces the string with a space. Only targets these 3 common domain names. 
ap['selftext'] = ap['selftext'].str.replace(r'(\S(?=.*(\.com|\.net|\.org))\S*)', 
                                            ' ', 
                                            regex=True, 
                                            case=False)
ap['title'] = ap['title'].str.replace(r'(\S(?=.*(\.com|\.net|\.org))\S*)', 
                                      ' ', 
                                      regex=True, 
                                      case=False)

The links have been removed:

In [37]:
ap[ap['selftext'].str.contains('http://', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title


In [38]:
ap[ap['selftext'].str.contains('www.', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title


### Removing new line characters

Some new line characters were spotted:

In [39]:
ap[ap['selftext'].str.contains('\n', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
3,Gingerstrands,,1635283070,"I went to the Apple Store today to look at the pros in person. To be honest, I’m even more uncer...",apple,"To anyone like me who was wondering: yes, you can fit two apps side by side on the 14” comfortab..."
15,SnooJokes102,,1635269119,I bought some airpods pro about a year bought the full cover for it including accidents however...,apple,Question: is it better to ship airpods to apple or take them to the apple store?
54,Furstman,,1635203770,I did download some Apple Desktop Background pictures and I saw an iCloud Icon and download it b...,apple,I installed Monterey and I can't delete the Apple Background Pictures.
55,binarysmurf,,1635201204,"i have an iMac 13,2 which EOL'ed at Catalina. I used OpenCore Legacy Patcher to install Big Sur,...",apple,OpenCore Legacy Patcher Big Sur -&gt; OpenCore Legacy Patcher Monterey?
59,KatnUSSblack,,1635197242,"What are your likes, dislikes, things you wanna try? \n\nHow is it going to enhance your product...",apple,iOS 15.1 - how are we feeling about it??


They will be removed from the `title` and `selftext` and replaced with spaces:

In [40]:
ap['selftext'] = ap['selftext'].str.replace('\n', ' ', regex=False)
ap['title'] = ap['title'].str.replace('\n', ' ', regex=False)

The characters have been removed.

In [41]:
ap[ap['selftext'].str.contains('\n', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title


In [42]:
ap[ap['title'].str.contains('\n', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title


### Removing '[View Poll]('

Some instances of '[View Poll](' were spotted:

In [43]:
ap[ap['selftext'].str.contains('[View Poll](', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
559,clemenslucas,,1634472586,[View Poll](,apple,"Will a new MacBook Pro be announced at Apple's ""Unleashed"" Event?"
560,clemenslucas,,1634472362,[View Poll](,apple,"Will new AirPods be announced at Apple's ""Unleashed"" Event?"
562,clemenslucas,,1634472231,[View Poll](,apple,"What greeting will Tim Cook open Apple's ""Unleashed"" Keynote with?"
563,clemenslucas,,1634471943,[View Poll](,apple,"Will a new Mac mini be announced at Apple's ""Unleashed"" Event?"
564,clemenslucas,,1634471561,[View Poll](,apple,"What greeting will Tim Cook open Apple's ""Unleashed"" Keynote with?"


They will be removed from the `title` and `selftext` and replaced with spaces:

In [44]:
ap['selftext'] = ap['selftext'].str.replace('[View Poll](', ' ', regex=False)
ap['title'] = ap['title'].str.replace('[View Poll](', ' ', regex=False)

The characters have been removed.

In [45]:
ap[ap['selftext'].str.contains('[View Poll](', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title


In [46]:
ap[ap['title'].str.contains('[View Poll](', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title


### Removing duplicates between `selftext` and `title` 

Some rows have the same `selftext` and `title`.

In [47]:
len(ap[ap['selftext'] == ap['title']])

18

The `selftext` will be removed in those cases.

In [48]:
ap.loc[(ap['selftext'] == ap['title']), 'selftext'] = ''

There are no more rows with the same `selftext` and `title`.

In [49]:
len(ap[ap['selftext'] == ap['title']])

0

### Dropping the `author_flair_css_class` column 

We do not need the `author_flair_css_class` column any more. Thus, we will drop it. 

In [50]:
ap.drop(columns=['author_flair_css_class'], inplace=True)

The column has been dropped. 

In [51]:
ap.head(1)

Unnamed: 0,author,created_utc,selftext,subreddit,title
3,Gingerstrands,1635283070,"I went to the Apple Store today to look at the pros in person. To be honest, I’m even more uncer...",apple,"To anyone like me who was wondering: yes, you can fit two apps side by side on the 14” comfortab..."


### Previewing the results

In [52]:
ap.head(50)

Unnamed: 0,author,created_utc,selftext,subreddit,title
3,Gingerstrands,1635283070,"I went to the Apple Store today to look at the pros in person. To be honest, I’m even more uncer...",apple,"To anyone like me who was wondering: yes, you can fit two apps side by side on the 14” comfortab..."
4,BiovelaBoomer,1635282964,"What is the correct way to use the battery pack, would it be plugging in the cable into the ipho...",apple,Correct way to use battery pack?
10,gambler__,1635278050,"As the title says, I am curious why Siri can't recognize the proper pronunciation of hard to pro...",apple,Will Siri ever be able to recognize the proper pronunciation of hard-to-pronounce names?
15,SnooJokes102,1635269119,I bought some airpods pro about a year bought the full cover for it including accidents however...,apple,Question: is it better to ship airpods to apple or take them to the apple store?
21,DentistSea2573,1635257400,"I’m a student and I want to get an iPad for note taking, and maybe for drawing also. But I don’t...",apple,Should I get the 2021 12.9” iPad Pro or wait till next year?
49,Expensive_Age3018,1635209054,So I block someone on my iphone but suddenly I get a text from them on iMessage on my Desktop. W...,apple,iMessage
50,candlesVI,1635207133,So the iphone 11 on the apple store says Nov 16 - Nov 23. Does that mean it will arrive at my do...,apple,Iphone 11
53,CenZen,1635204014,"It’s like the title said, I just received confirmation that I’m getting a job and that I will be...",apple,I just got a job as a tier 1 tech support agent. I have questions about the iMac they’re sending...
54,Furstman,1635203770,I did download some Apple Desktop Background pictures and I saw an iCloud Icon and download it b...,apple,I installed Monterey and I can't delete the Apple Background Pictures.
55,binarysmurf,1635201204,"i have an iMac 13,2 which EOL'ed at Catalina. I used OpenCore Legacy Patcher to install Big Sur,...",apple,OpenCore Legacy Patcher Big Sur -&gt; OpenCore Legacy Patcher Monterey?


The number of rows and columns in the data is as follows:

In [53]:
ap.shape

(7208, 5)

## Saving the DataFrame to CSV

In [54]:
ap.to_csv('../data/ap.csv', index=False)

# Data Filtering (Samsung)

## Loading the data 

The dataset from the Samsung subreddit will be loaded. We will only select certain columns.

In [55]:
ss = pd.read_csv('../data/df_samsung.csv', usecols=['subreddit',
                                                    'author',
                                                    'created_utc',
                                                    'selftext',
                                                    'title',
                                                    'author_flair_css_class'])

In [56]:
ss.head(5)

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
0,llThat1Guyll,,1635291756,I think it mightve Changed with the latest update. I swear it looks like longer or something. It...,samsung,"Did the Devil emojii change for Android? At least Samsung, because I swear it Does, but I can't ..."
1,Gold_Enigma,,1635289011,I bought a pair of galaxy buds+ a little under a year ago in the US and thought they worked very...,samsung,Samsung is really pissing me off today?
2,noobmistermuffin,,1635287113,I work in a phone repair store I don't speak for the brand or any company I speak with what I've...,samsung,DONT UPDATE YOUR A21 PHONE
3,UnusualMedico,,1635287012,,samsung,Help with trade in
4,Jorycle,,1635285986,Rant mode.\n\nSo we bought some appliances like 6 months back. It took them so long to come in s...,samsung,"Samsung delivery policies: Stupid, or the stupidest?"


## Filtering the data 

### Removing rows with null values in the `selftext` column

In [57]:
ss['selftext'].isna().value_counts()

False    7383
True     2612
Name: selftext, dtype: int64

There are rows with null values in the `selftext` column. We will drop the rows:

In [58]:
ss.dropna(subset=['selftext'], inplace=True)

There are no more null values in the `selftext` column:

In [59]:
ss['selftext'].isna().value_counts()

False    7383
Name: selftext, dtype: int64

### Removing rows with null values in the `title` column

There are no rows with null values in the `title` column.

In [60]:
ss['title'].isna().value_counts()

False    7383
Name: title, dtype: int64

### Dropping duplicates (according to `title` and `selftext`)

In [61]:
ss.shape

(7383, 6)

There are 7383 rows in the DataFrame.

In [62]:
ss.drop_duplicates(subset=['title', 'selftext'], inplace=True)

The number is now different as some duplicate rows were dropped:

In [63]:
ss.shape

(7207, 6)

### Removing posts with [removed] as the `selftext`

There are no posts with '[removed]' as the `selftext` as we have already filtered them out at the webscraping stage.

In [64]:
(ss['selftext'] == '[removed]').value_counts()

False    7207
Name: selftext, dtype: int64

### Removing posts with [deleted] as the `selftext`

There are a number of posts with '[deleted]' as the `selftext`:

In [65]:
(ss['selftext'] == '[deleted]').value_counts()

False    7097
True      110
Name: selftext, dtype: int64

Removing the rows:

In [66]:
ss = ss[ss['selftext'] != '[deleted]']

There are no more rows with '[deleted]' as the `selftext`.

In [67]:
(ss['selftext'] == '[deleted]').value_counts()

False    7097
Name: selftext, dtype: int64

### Removing completely empty posts

There are no posts that have an empty string as the `selftext`:

In [68]:
(ss['selftext'] == '').value_counts()

False    7097
Name: selftext, dtype: int64

### Removing AutoModerator posts

There are a number of posts by 'AutoModerator'.

In [69]:
(ss['author'] == 'AutoModerator').value_counts()

False    7091
True        6
Name: author, dtype: int64

The posts are as follows:

In [70]:
ss[ss['author'] == 'AutoModerator'].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
32,AutoModerator,,1635242418,Welcome to the Daily Support thread for [r/Samsung](https://www.reddit.com/r/samsung/). You can ...,samsung,Daily Support Thread
2508,AutoModerator,,1625445012,"# Join our discord for awesome tech support, great chat, and future giveaways! [https://discord....",samsung,*** HEY YOU! YA YOU! ***
2547,AutoModerator,,1625392814,Welcome to the Daily (Tech) Support thread for [r/Samsung](https://www.reddit.com/r/samsung/). Y...,samsung,Daily Tech Support Thread
4226,AutoModerator,,1622839994,"# If you have ANY questions, please join our official discord and we'll be happy to answer them...",samsung,We're looking for new moderators!
4229,AutoModerator,,1622838616,"# If you have ANY questions, please join our official discord and we'll be happy to answer them....",samsung,We're looking for new moderators!


They appear to be moderator posts opening discussion threads and will be filtered out:

In [71]:
ss = ss[ss['author'] != 'AutoModerator']

There are no more posts by 'AutoModerator'.

In [72]:
(ss['author'] == 'AutoModerator').value_counts()

False    7091
Name: author, dtype: int64

### Removing moderator posts 

Using the `author_flair_css_class` column, we can see that there is no category for moderator posts, unlike in the Apple data.

In [73]:
ss['author_flair_css_class'].value_counts(dropna=False)

NaN             6548
custom           108
Note 10 Plus      47
S20 Ultra         41
S20 Plus          39
Note9             37
S10 Plus          35
S20               34
S9 Plus           26
S10 e             25
S10               20
Fold              19
S8                16
Note8             11
S9                11
iphone            10
S1                 8
S8 Plus            8
a30                7
S2                 7
s21 series         7
S10 5G             7
S7                 5
a20                4
Note 10            4
Note3              2
Note2              2
J7 2015            1
S6                 1
Note5              1
Name: author_flair_css_class, dtype: int64

### Removing links

There are links in the data:

In [74]:
ss[ss['selftext'].str.contains('http://', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
270,justarandomaccname,,1634372088,So today I went to the Samsung US site to check if there were any new offers for the Galaxy Tab ...,samsung,Samsung Galaxy Unpacked Part 2?
525,yashwanthms,,1633453689,"I'm not able to change any notification settings for WhatsApp messages, since they're not being ...",samsung,"Whatsapp message notifications being categorised as ""Silent Notifications"" by default."
1160,AussieP1E,,1631482362,Anyone else's Galaxy watch 4 shedding on the watch band?\nSeems like the coating on top is comin...,samsung,Galaxy Watch4 band shedding?
1271,kb389,Fold,1631130738,"How do i turn it off?\n\nhttp://imgur.com/gallery/kg2aHhu\n\nVery annoying, I somehow enabled it...",samsung,Game booster priority mode?
1459,ronjon2018,,1630424907,Since I got my Z Fold3 (from Note20 Ultra) I decided to start using One UI cuz Nova doesn't rlly...,samsung,Play store icons in One UI!


In [75]:
ss[ss['selftext'].str.contains('www.', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
8,Shadeslayur,custom,1635281486,"I've had a S21 Ultra since launch. I've tried a number of screen protectors for this device, not...",samsung,The Whitestone Dome screen protector for the S21 Ultra is hands down the best option for the dev...
9,Jackasaurusrex31,,1635281353,"I've had a S21 Ultra since launch. I've tried a number of screen protectors for this device, not...",samsung,The Whitestone Dome Glass screen protector for the Galaxy S21 Ultra is hands down the best scree...
53,Trimm0311,,1635181806,Is better Iphone 8 plus or Samsung S10e or other phone 200-250€ also refurbished. I need more fo...,samsung,which better
81,syresynth,custom,1635085526,My previous [post](https://www.reddit.com/r/GalaxyFold/comments/qemr52/lcd_very_bright_glitching...,samsung,Screen of Z fold 2 very bright glitching (in video) initially . 1 year and 2 weeks of ownership....
88,UMZ747,,1635062130,I have heard about a lot of issues about s20 series regarding camera and screen reliability. Wha...,samsung,S10 plus used or S20/20+ used?


Links in the title and selftext will be removed using the following code:

In [76]:
# Using a Regex expression to match a string with the following format: 
# www.example(rest of link) (or starting with http or https). Case insensitive. 
# Replaces the string with a space. Does not target links not starting with http or www.  
ss['selftext'] = ss['selftext'].str.replace(r'(http|www[.])[\S]*', 
                                            ' ', 
                                            regex=True, 
                                            case=False)
ss['title'] = ss['title'].str.replace(r'(http|www[.])[\S]*', 
                                      ' ', 
                                      regex=True, 
                                      case=False)

# Using a Regex expression to match a string with the following format:
# example.com (or ending with .net or .org). Case insensitive. 
# Replaces the string with a space. Only targets these 3 common domain names. 
ss['selftext'] = ss['selftext'].str.replace(r'(\S(?=.*(\.com|\.net|\.org))\S*)', 
                                            ' ', 
                                            regex=True, 
                                            case=False)
ss['title'] = ss['title'].str.replace(r'(\S(?=.*(\.com|\.net|\.org))\S*)', 
                                      ' ', 
                                      regex=True, 
                                      case=False)

The links have been removed:

In [77]:
ss[ss['selftext'].str.contains('http://', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title


In [78]:
ss[ss['selftext'].str.contains('www.', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title


### Removing new line character

Some new line characters were spotted:

In [79]:
ss[ss['selftext'].str.contains('\n', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
1,Gold_Enigma,,1635289011,I bought a pair of galaxy buds+ a little under a year ago in the US and thought they worked very...,samsung,Samsung is really pissing me off today?
2,noobmistermuffin,,1635287113,I work in a phone repair store I don't speak for the brand or any company I speak with what I've...,samsung,DONT UPDATE YOUR A21 PHONE
4,Jorycle,,1635285986,Rant mode.\n\nSo we bought some appliances like 6 months back. It took them so long to come in s...,samsung,"Samsung delivery policies: Stupid, or the stupidest?"
7,ryang4415,,1635283668,"AT&amp;T has a promotion right now where if you buy the Z Flip3 on an installment, you get the b...",samsung,"Is the Z Flip3 worth it? If you have one, do you regret it?"
8,Shadeslayur,custom,1635281486,"I've had a S21 Ultra since launch. I've tried a number of screen protectors for this device, not...",samsung,The Whitestone Dome screen protector for the S21 Ultra is hands down the best option for the dev...


They will be removed from the `title` and `selftext` and replaced with spaces:

In [80]:
ss['selftext'] = ss['selftext'].str.replace('\n', ' ', regex=False)
ss['title'] = ss['title'].str.replace('\n', ' ', regex=False)

The characters have been removed.

In [81]:
ss[ss['selftext'].str.contains('\n', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title


In [82]:
ss[ss['title'].str.contains('\n', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title


### Removing '[View Poll]('

Some instances of '[View Poll](' were spotted:

In [83]:
ss[ss['selftext'].str.contains('[View Poll](', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title
53,Trimm0311,,1635181806,Is better Iphone 8 plus or Samsung S10e or other phone 200-250€ also refurbished. I need more fo...,samsung,which better
88,UMZ747,,1635062130,I have heard about a lot of issues about s20 series regarding camera and screen reliability. Wha...,samsung,S10 plus used or S20/20+ used?
164,ScholarImaginary9280,,1634788541,Should I upgrade to Samsung Galaxy zflip 5G or wait for samsung 2022 flagship phones? I have a s...,samsung,Zflip 3 5G or 2022 samsung 2022 flagship phones.
224,Sovereign108,,1634560762,Just curious what the trend is :) also needing to decide somewhat soon. Jus comparing two manufa...,samsung,Where is the hype trend? Galaxy S22 or Pixel 6
241,GamerBeast954,s21 series,1634467949,Was it a good upgrade or you wished you didn’t upgrade? [View Poll](,samsung,Which Samsung phone did you purchased this year?


They will be removed from the `title` and `selftext` and replaced with spaces:

In [84]:
ss['selftext'] = ss['selftext'].str.replace('[View Poll](', ' ', regex=False)
ss['title'] = ss['title'].str.replace('[View Poll](', ' ', regex=False)

The characters have been removed.

In [85]:
ss[ss['selftext'].str.contains('[View Poll](', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title


In [86]:
ss[ss['title'].str.contains('[View Poll](', regex=False)].head()

Unnamed: 0,author,author_flair_css_class,created_utc,selftext,subreddit,title


### Removing duplicates between `selftext` and `title` 

Some rows have the same `selftext` and `title`.

In [87]:
len(ss[ss['selftext'] == ss['title']])

30

The `selftext` will be removed in those cases.

In [88]:
ss.loc[(ss['selftext'] == ss['title']), 'selftext'] = ''

There are no more rows with the same `selftext` and `title`.

In [89]:
len(ss[ss['selftext'] == ss['title']])

0

### Dropping the `author_flair_css_class` column 

We do not need the `author_flair_css_class` column any more. Thus, we will drop it. 

In [90]:
ss.drop(columns=['author_flair_css_class'], inplace=True)

The column has been dropped. 

In [91]:
ss.head(1)

Unnamed: 0,author,created_utc,selftext,subreddit,title
0,llThat1Guyll,1635291756,I think it mightve Changed with the latest update. I swear it looks like longer or something. It...,samsung,"Did the Devil emojii change for Android? At least Samsung, because I swear it Does, but I can't ..."


### Previewing the results

In [92]:
ss.head(50)

Unnamed: 0,author,created_utc,selftext,subreddit,title
0,llThat1Guyll,1635291756,I think it mightve Changed with the latest update. I swear it looks like longer or something. It...,samsung,"Did the Devil emojii change for Android? At least Samsung, because I swear it Does, but I can't ..."
1,Gold_Enigma,1635289011,I bought a pair of galaxy buds+ a little under a year ago in the US and thought they worked very...,samsung,Samsung is really pissing me off today?
2,noobmistermuffin,1635287113,I work in a phone repair store I don't speak for the brand or any company I speak with what I've...,samsung,DONT UPDATE YOUR A21 PHONE
4,Jorycle,1635285986,Rant mode. So we bought some appliances like 6 months back. It took them so long to come in sto...,samsung,"Samsung delivery policies: Stupid, or the stupidest?"
6,ryang4415,1635284195,I can probably upgrade to a better Samsung up to $1000 for free. What should I get? It's been a ...,samsung,What phone should I upgrade to?
7,ryang4415,1635283668,"AT&amp;T has a promotion right now where if you buy the Z Flip3 on an installment, you get the b...",samsung,"Is the Z Flip3 worth it? If you have one, do you regret it?"
8,Shadeslayur,1635281486,"I've had a S21 Ultra since launch. I've tried a number of screen protectors for this device, not...",samsung,The Whitestone Dome screen protector for the S21 Ultra is hands down the best option for the dev...
9,Jackasaurusrex31,1635281353,"I've had a S21 Ultra since launch. I've tried a number of screen protectors for this device, not...",samsung,The Whitestone Dome Glass screen protector for the Galaxy S21 Ultra is hands down the best scree...
10,jking1676,1635280239,"When using certain apps (Amazon, Teams, Instagram), pretty much anything with messaging capabili...",samsung,S21 keyboard issue
12,rohitvarma1986,1635276551,Basically the title . Currently the only way to add apps from finder search to home screen is vi...,samsung,Any way to add app directly from drawer search to home screen.


The number of rows and columns in the data is as follows:

In [93]:
ss.shape

(7091, 5)

## Saving the DataFrame to CSV

In [94]:
ss.to_csv('../data/ss.csv', index=False)

In the next notebook, the data will be explored.