# Background



Apple was created in 1976, by Steve Jobs, a 21-year-old San-Francisco-born hippie, together with Steve Wozniak. But, Wozniak was the person who first built the Apple product called ‘Apple I'. However, by 1997, Jobs led Apple through a massive growth period when he launched iMac, MacOS10, iPod, and iTunes. But it was the iPhone, which launched on 29 June 2007 that overtook the market share. Since then, Apple sold around 2.8b smartphones to date, including 242m devices, last year alone in 2022.

Samsung was founded in 1938 by 28-year-old Lee Byung-Chull. Initially, Samsung was a small trading company dealing in dried fish and noodles. But with the expansion to numerous industries including textile and financial services, Samsung found its golden goose in the late 1960s in the electronics business. There was immense growth in the 1970s and 1980s in its semiconductor business. But, on 29 June 2009,  two years after the iPhone was launched, Samsung introduced the ‘Samsung Galaxy’, its first smartphone.  Since then, Samsung has sold 2b smartphones, and 272m smartphones just in the past year in 2022.

source: https://hellostake.com/nz/blog/stake-updates/head-to-head-apple-vs-samsung

# Problem Statement

- An advertising agency would like to do a marketing campaign to increase the market mindshare for Samsung products in particular mobile smartphones and in turn drive up sales. While Samsung is one of the top leads in the smartphone market share worldwide at 23%, Apple is close behind.
- As part of the marketing campaign preparation, the agency would like to understand the brand mindshare of its products from the internet particularly in the user-generated-content (UGC) domain such as social media channels ie facebook, instagram, twitter and review channels such as tech.radar,trustedreviews.com etc. They were tasked to scrap through all the UGC channels from twitter, facebook, instagram, amazon, reddits to explore and discover what consumers are talking about Samsung; understand the word associations and topics of interest for Samsung consumers.
- The marketing team approached the task with the manual laborious approach of cut-copy-paste from the respective UGC channels and classified them according to its respective buckets i.e. Samsung and its competitors Apple, Oppo etc.
- This approach took too much time which was better spend on creating the most effective marketing mix for the campaign, talking to stakeholders, designers etc. 
- To resolve their problem, the marketing team approached the data science team for help; and the data science team decided to create a simple NLP model that would help the marketing team in saving time. 
- To kick-start, they have identified r/samsung (sub-reddit of samsung) to build an NLP machine learning model and apply it to other UGC channels, with r/apple as its model comparison. 
- The goal is on information retrieval on user generated content across all UGC channels particularly Samsung products. The intent is to scrape all the comments, posts and reviews related to Samsung and place it in the correct Samsung bucket. After this task 1 classification, the 2nd task is to look through all the Samsung posts, reviews etc and deep dive and understand the word associations and topics of interest for Samsung consumers; and with these insights create an effective marketing campaign.
- In summary, the machine learning model is tasked to correctly identify content that is Samsung related and place them in the right label category i.e. "Samsung"; to meet the goal of task 1 classification. This would save the marketing team time on the laborious manual process of classifying the content across all UGC channels. 

### Import Libraries


In [2]:
import pandas as pd
import praw

### Web scrapping from r/apple and r/samsung

In [3]:
reddit = praw.Reddit(
    client_id="_q9uJzv3B9pJ0JLMlkqKVQ",
    client_secret="yuqJYrPq6fEYY9bTPCqPnp2S77y08A",
    password="seaFinder29!",
    user_agent="project_webscrape",
    username="mumfordseas",
)

In [4]:
#dataframe hot and new apple post 

posts = []
r_apple = reddit.subreddit('apple')
for post in r_apple.hot(limit=1000):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
hotapple_posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

posts = []
r_apple = reddit.subreddit('apple')
for post in r_apple.new(limit=1000):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
newapple_posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

#dataframe hot and new samsung post 

posts = []
r_samsung = reddit.subreddit('samsung')
for post in r_samsung.hot(limit=1000):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
hotsamsung_posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

posts = []
r_samsung = reddit.subreddit('samsung')
for post in r_samsung.new(limit=1000):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
newsamsung_posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

In [5]:
#check shape to ensure >1000 posts
print(hotapple_posts.shape)
print(newapple_posts.shape)
print(hotsamsung_posts.shape)
print(newsamsung_posts.shape)

(707, 8)
(678, 8)
(433, 8)
(992, 8)


In [6]:
#combine and save apple posts
apple_results= pd.concat([hotapple_posts,newapple_posts],ignore_index=True)
apple_results.shape

(1385, 8)

In [7]:
#combine and save samsung posts
samsung_results= pd.concat([hotsamsung_posts,newsamsung_posts],ignore_index=True)
samsung_results.shape

(1425, 8)

In [8]:
#save apple and samsung results into csv file
apple_results.to_csv(path_or_buf='../datasets/posts_apple.csv',index=False, header = True)
samsung_results.to_csv(path_or_buf='../datasets/posts_samsung.csv',index=False, header = True)

In [9]:
#combine all into 1 dataframe and save it in a csv for modelling
results= pd.concat([hotapple_posts,newapple_posts,hotsamsung_posts,newsamsung_posts],ignore_index=True)
results.shape

(2810, 8)

In [10]:
#save into csv file
results.to_csv(path_or_buf='../datasets/posts_results.csv',index=False, header = True)