# Scraping Hello Peter

- In this notebook i will be scraping the Hello Peter website for reviews of a specific company.

Consider a company name on `Hello Peter`

- `Telkom`

Discovering from the website hellopeter.com it uses a microservice architecture. The reviews are stored in a database and the website is just a frontend to the database. There is a normal website hosting the frontend and a separate website hosting the API.

The API is hosted on `api.hellopeter.com` and the frontend is hosted on `hellopeter.com`

Therefore running web diagnostics on the API will give us the information we need.

I have reversed engineered the API. To get reviews for a specific company you need to make a `GET` request to the following endpoint:

`https://api.hellopeter.com/consumer/business/<companynamewithdash>/reviews?page=<pagenumber>`

Quite interesting to note that the company name is in the URL and not in the query parameters.

Anyway our life is easy now. We just need to make a `GET` request to the endpoint and we will get the reviews for the company.

- `Telkom`

```
https://api.hellopeter.com/consumer/business/telkom/reviews?page=1 to get the first page of reviews
...
...
https://api.hellopeter.com/consumer/business/telkom/reviews?page=n to get the nth page of reviews
```



#### Before we start
Basic imports

In [1]:
import pandas as pd
import numpy as np # I have found out that pandas sucks when working with large data sets apparently numpy is better when working with large data sets
# Assumption is that we might get comments 100k or more
import requests # for making http requests
import json # for parsing json data

### General Solution
I am making a function to get the reviews for a specific company. The function will take in the company name and the number of pages to scrape.

In [24]:
def GetReviews(companyname,pageno):
    # convert company name where spaces are replaced with hyphens
    df = pd.DataFrame()
    companyname=companyname.replace(" ","-")
    url = 'https://api.hellopeter.com/consumer/business/'+companyname+'/reviews?page='+str(1)
    page = requests.get(url)
    jsondata=json.loads(page.text)    
    last_page=jsondata['last_page']        
    df = df.from_records(jsondata['data'])
    i = 1

    while i <= last_page:
        if pageno == 0:
            print("Processing page: "+str(i)+ " of "+str(last_page))
            pass
        else:
            print("Processing page: "+str(i)+ " of "+str(pageno))
            if i == pageno:
                break
        
        i = i + 1
        url = 'https://api.hellopeter.com/consumer/business/'+companyname+'/reviews?page='+str(i)
        page = requests.get(url)
        jsondata=json.loads(page.text)
        df = pd.concat([df,df.from_records(jsondata['data'])])
        
    return df

# ill just do 100 pages for now
df = GetReviews("telkom",100)
df.head()

Processing page: 1 of 100
Processing page: 2 of 100
Processing page: 3 of 100
Processing page: 4 of 100
Processing page: 5 of 100
Processing page: 6 of 100
Processing page: 7 of 100
Processing page: 8 of 100
Processing page: 9 of 100
Processing page: 10 of 100
Processing page: 11 of 100
Processing page: 12 of 100
Processing page: 13 of 100
Processing page: 14 of 100
Processing page: 15 of 100
Processing page: 16 of 100
Processing page: 17 of 100
Processing page: 18 of 100
Processing page: 19 of 100
Processing page: 20 of 100
Processing page: 21 of 100
Processing page: 22 of 100
Processing page: 23 of 100
Processing page: 24 of 100
Processing page: 25 of 100
Processing page: 26 of 100
Processing page: 27 of 100
Processing page: 28 of 100
Processing page: 29 of 100
Processing page: 30 of 100
Processing page: 31 of 100
Processing page: 32 of 100
Processing page: 33 of 100
Processing page: 34 of 100
Processing page: 35 of 100
Processing page: 36 of 100
Processing page: 37 of 100
Processing

Unnamed: 0,id,user_id,created_at,authorDisplayName,author,authorAvatar,author_id,review_title,review_rating,review_content,...,industry_name,industry_slug,status_id,nps_rating,source,is_reported,business_reporting,author_created_date,author_total_reviews_count,attachments
0,4210659,3d503e80-740f-11ed-a6fb-ff77d0e4f2fe,2022-12-04 22:21:38,Lumka K,Lumka K,,3d503e80-740f-11ed-a6fb-ff77d0e4f2fe,Telkom scam,1,Telkom now scams it's customers. I have loaded...,...,Telecommunications,telecommunications,1,,WEBSITE,False,,2022-12-04,2,[]
1,4210541,3de70ab1-31fa-11e8-83f4-f23c91bb6188,2022-12-04 18:21:43,mapitso M,mapitso M,,3de70ab1-31fa-11e8-83f4-f23c91bb6188,Telkom worst service no value for once money c...,1,Worst service one can ever vouch for with your...,...,Telecommunications,telecommunications,1,,WEBSITE,False,,2018-03-27,8,[]
2,4210535,38d6e070-a691-11ec-a4a8-59cdfcc4b383,2022-12-04 18:05:52,KathleenMary B,KathleenMary B,,38d6e070-a691-11ec-a4a8-59cdfcc4b383,LTE sim only contract,1,I have a LTE sim only contract which I'd upgra...,...,Telecommunications,telecommunications,1,,WEBSITE,False,,2022-03-18,5,[]
3,4210402,455165c0-34db-11eb-90e0-67eed689742d,2022-12-04 14:10:46,Jason W,Jason W,,455165c0-34db-11eb-90e0-67eed689742d,Telkom,1,On the 4 Dec 2023 at 13h40 I phoned in in tryi...,...,Telecommunications,telecommunications,1,,WEBSITE,False,,2020-12-02,2,[]
4,4210377,05288e4c-31fa-11e8-83f4-f23c91bb6188,2022-12-04 13:08:14,lungile S,lungile S,,05288e4c-31fa-11e8-83f4-f23c91bb6188,Over billing,1,I took LTE sim with them but they keep chargin...,...,Telecommunications,telecommunications,1,,WEBSITE,False,,2011-07-07,14,[]


### Lets check shape of the data

In [26]:
# shape of the data frame
df.shape

(1100, 27)

### Saving data for future use

In [29]:
# make a directory to store the data
import os
try:
    os.mkdir("data")
except:
    pass

# save the data frame to a csv file
df.to_csv("data/telkom.csv",index=False)

### Evaluation of Business Reviews

In [38]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentiment = SentimentIntensityAnalyzer()

# iterate over the data frame 
for index, row in df.iterrows():
    currentReview = row['review_title']+'\n'+row['review_content']
    sentimentScore = sentiment.polarity_scores(currentReview)
    df.loc[index,'compound'] = sentimentScore['compound']
    df.loc[index,'neg'] = sentimentScore['neg']
    df.loc[index,'neu'] = sentimentScore['neu']
    df.loc[index,'pos'] = sentimentScore['pos']

for index,row in df.iterrows():
    if row['compound'] >= 0.05:
        df.loc[index,'sentiment'] = 'positive'
    elif row['compound'] <= -0.05:
        df.loc[index,'sentiment'] = 'negative'
    else:
        df.loc[index,'sentiment'] = 'neutral'

# count the number of positive, negative and neutral reviews
print(df['sentiment'].value_counts())

# print percentage of positive, negative and neutral reviews
print(df['sentiment'].value_counts(normalize=True))

negative    600
positive    500
Name: sentiment, dtype: int64
negative    0.545455
positive    0.454545
Name: sentiment, dtype: float64
