<img src="https://nserc-hi-am.ca/2020/wp-content/uploads/sites/18/2019/12/McGill.png" width="500" height="400" align="left">

#### Date: February 5th, 2021

**Objective:**

In this assignment you are an analytics consultant to a (i) brand manager, (ii) product manager and (iii) advertising manager. Your job is to give advice/insights to these individuals based on the analysis of social media conversations. The detailed tasks are described below.

Develop a crawler/scraper to fetch messages posted in Edmunds.com discussion forums. The crawler output should be a .csv file with the following columns: date, userid, and message (even though you will only use the messages in your analysis).

Fetch around 5,000 posts about cars from a General topics forum. Do NOT choose a forum dedicated to a particular brand or model. Instead, you can choose the General & Sedans categories and then select, for example, the Entry Level Luxury forum (https://forums.edmunds.com/discussion/2864/general/x/entry-level-luxury-performance- sedans). The idea is to have multiple brands and models being discussed without one of them being the focal point.



In [17]:
#Libraries

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [18]:
#Scraping from edmunds.com

url = 'https://forums.edmunds.com/discussion/7526/general/x/midsize-sedans-2-0'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
#comment_div = soup.find(id='Comment_3515400')
#print(comment_div.text)


In [3]:
models = pd.read_csv('models.csv')

In [4]:
models.head()

Unnamed: 0,acura,integra
0,acura,Legend
1,acura,vigor
2,acura,rlx
3,acura,ILX
4,acura,MDX


In [5]:
models.columns =['brand', 'model'] 

In [7]:
models

Unnamed: 0,brand,model
0,acura,Legend
1,acura,vigor
2,acura,rlx
3,acura,ILX
4,acura,MDX
...,...,...
522,volvo,xc90
523,volvo,s60
524,volvo,s80
525,volvo,v60


In [10]:
#root = pd.read_csv('root.csv')
extract = pd.read_csv('edmunds_extraction2.csv')

In [11]:
extract

Unnamed: 0,Counter,Date,User,Comment
0,1,"April 11, 2007 6:52PM",\nmotownusa,\nHi Pat:You forgot the Chrysler Sebring
1,2,"April 11, 2007 7:33PM",\nexshoman,\nI'm sure some folks would appreciate having ...
2,3,"April 12, 2007 6:51AM",\ntargettuning,\nYou can try to revive this topic but without...
3,4,"April 12, 2007 8:43AM",\npat,\nModel vs. model is exactly what we're here f...
4,5,"April 13, 2007 11:49AM",\nperna,\nThe Altima is my favorite of the bunch. It i...
...,...,...,...,...
4995,4996,"September 12, 2007 12:49PM",\nkdshapiro,"\n""Let me try one more time. Accord and Camry ..."
4996,4997,"September 12, 2007 12:55PM",\nbenderofbows,"\n""No one likes repairs on their car but I fai..."
4997,4998,"September 12, 2007 1:24PM",\njimlockey,\nToyota & GM gives you guys what they want yo...
4998,4999,"September 12, 2007 2:21PM",\ncaptain2,\nWhat weaknesses? you have goota be kidding ...


In [12]:
xl = extract['Comment'].str.lower()

In [13]:
extract['Comment'] = xl

In [None]:
# This steps takes time

nltk.download('stopwords')
from nltk.tokenize import word_tokenize
xl = xl.tolist()
for i in range(len(xl)): 
    text_tk = word_tokenize(xl[i])
    s = [word for word in text_tk if not word in stopwords.words()]
    xl[i] = ' '.join([str(elem) for elem in s])

In [None]:
extract = extract.drop(['Comment'], axis=1)

In [None]:
extract['Comment'] = xl

In [None]:
extract.to_csv('dataset.csv')

In [None]:
models

In [None]:
models['model'][0].lower()

In [None]:
low_mod = models['model'].str.lower()
low_br = models['brand'].str.lower()

In [None]:
nltk.download('wordnet')

In [None]:
lemmatizer = WordNetLemmatizer() 

for j in range(len(xl)):
    tk = word_tokenize(xl[j])
    s = []
    for word in tk:
        f = 0
        for i in range(len(low_mod)):
            if word == low_mod[i]:
                f=1
                s.append(low_br[i])
        if f==0:
            s.append(lemmatizer.lemmatize(word))
        f=0

    xl[j] = ' '.join([str(elem) for elem in s])             

In [None]:
xl

In [25]:
extract

Unnamed: 0,Counter,Date,User,Comment
0,1,"April 11, 2007 6:52PM",\nmotownusa,hi pat : forgot chrysler sebring
1,2,"April 11, 2007 7:33PM",\nexshoman,'m sure folks would appreciate malibu included...
2,3,"April 12, 2007 6:51AM",\ntargettuning,try revive topic without able discuss ( howeve...
3,4,"April 12, 2007 8:43AM",\npat,model vs. model exactly 're ! manufacturer vs....
4,5,"April 13, 2007 11:49AM",\nperna,altima favorite bunch . amongst fastest best h...
...,...,...,...,...
4995,4996,"September 12, 2007 12:49PM",\nkdshapiro,'' let try time . accord camry dominant namepl...
4996,4997,"September 12, 2007 12:55PM",\nbenderofbows,'' likes repairs car fail see huge hassle unle...
4997,4998,"September 12, 2007 1:24PM",\njimlockey,"toyota & gm gives guys have.ford dead , n't kn..."
4998,4999,"September 12, 2007 2:21PM",\ncaptain2,weaknesses ? goota kidding - cars well others ...


In [26]:
extract = extract.drop(['Comment'], axis=1)
extract['Comment'] = xl
extract.to_csv('preprocessed_dataset.csv')

In [27]:
extract

Unnamed: 0,Counter,Date,User,Comment
0,1,"April 11, 2007 6:52PM",\nmotownusa,hi pat : forgot chrysler chrysler
1,2,"April 11, 2007 7:33PM",\nexshoman,'m sure folk would appreciate chevrolet chevro...
2,3,"April 12, 2007 6:51AM",\ntargettuning,try revive topic without able discus ( however...
3,4,"April 12, 2007 8:43AM",\npat,model vs. model exactly 're ! manufacturer vs....
4,5,"April 13, 2007 11:49AM",\nperna,nissan favorite bunch . amongst fastest best h...
...,...,...,...,...
4995,4996,"September 12, 2007 12:49PM",\nkdshapiro,'' let try time . honda toyota toyota dominant...
4996,4997,"September 12, 2007 12:55PM",\nbenderofbows,'' like repair car fail see huge hassle unless...
4997,4998,"September 12, 2007 1:24PM",\njimlockey,"toyota & gm give guy have.ford dead , n't know..."
4998,4999,"September 12, 2007 2:21PM",\ncaptain2,weakness ? goota kidding - car well others wea...


In [28]:
br = list(models['brand'].unique())

In [49]:
count = np.zeros(len(br))
for i in range(len(br)):
    for comment in xl:
        if comment.find(br[i])>-1:
            count[i]+=1

In [50]:
frequencies = pd.DataFrame(br, columns =['brands']) 

In [51]:
frequencies['frequency'] = count

In [52]:
frequencies = frequencies.sort_values(by=['frequency'], ascending=False)

In [53]:
frequencies

Unnamed: 0,brands,frequency
5,car,2450.0
10,honda,2113.0
9,ford,1388.0
34,toyota,997.0
20,mazda,629.0
12,hyundai,606.0
24,nissan,580.0
27,problem,350.0
6,chevrolet,245.0
29,seat,243.0
