# Scraping Hello Peter

- In this notebook i will be scraping the Hello Peter website for reviews of a specific company.

Consider a company name on `Hello Peter`

- `Telkom`

Discovering from the website hellopeter.com it uses a microservice architecture. The reviews are stored in a database and the website is just a frontend to the database. There is a normal website hosting the frontend and a separate website hosting the API.

The API is hosted on `api.hellopeter.com` and the frontend is hosted on `hellopeter.com`

Therefore running web diagnostics on the API will give us the information we need.

I have reversed engineered the API. To get reviews for a specific company you need to make a `GET` request to the following endpoint:

`https://api.hellopeter.com/consumer/business/<companynamewithdash>/reviews?page=<pagenumber>`

Quite interesting to note that the company name is in the URL and not in the query parameters.

Anyway our life is easy now. We just need to make a `GET` request to the endpoint and we will get the reviews for the company.

- `Telkom`

```
https://api.hellopeter.com/consumer/business/telkom/reviews?page=1 to get the first page of reviews
...
...
https://api.hellopeter.com/consumer/business/telkom/reviews?page=n to get the nth page of reviews
```



#### Before we start
Basic imports

In [1]:
import pandas as pd
import numpy as np # I have found out that pandas sucks when working with large data sets apparently numpy is better when working with large data sets
# Assumption is that we might get comments 100k or more
import requests # for making http requests
import json # for parsing json data

### General Solution
I am making a function to get the reviews for a specific company. The function will take in the company name and the number of pages to scrape.

In [2]:
def GetReviews(companyname,pageno):
    # convert company name where spaces are replaced with hyphens
    df = pd.DataFrame()
    companyname=companyname.replace(" ","-")
    url = 'https://api.hellopeter.com/consumer/business/'+companyname+'/reviews?page='+str(1)
    page = requests.get(url)
    jsondata=json.loads(page.text)    
    last_page=jsondata['last_page']        
    df = df.from_records(jsondata['data'])
    i = 1

    while i <= last_page:
        if pageno == 0:
            print("Processing page: "+str(i)+ " of "+str(last_page))
            pass
        else:
            print("Processing page: "+str(i)+ " of "+str(pageno))
            if i == pageno:
                break
        
        i = i + 1
        url = 'https://api.hellopeter.com/consumer/business/'+companyname+'/reviews?page='+str(i)
        page = requests.get(url)
        jsondata=json.loads(page.text)
        df = pd.concat([df,df.from_records(jsondata['data'])])
        
    return df

# ill just do 100 pages for now
df = GetReviews("telkom",100)
df.head()

Processing page: 1 of 100
Processing page: 2 of 100
Processing page: 3 of 100
Processing page: 4 of 100
Processing page: 5 of 100
Processing page: 6 of 100
Processing page: 7 of 100
Processing page: 8 of 100
Processing page: 9 of 100
Processing page: 10 of 100
Processing page: 11 of 100
Processing page: 12 of 100
Processing page: 13 of 100
Processing page: 14 of 100
Processing page: 15 of 100
Processing page: 16 of 100
Processing page: 17 of 100
Processing page: 18 of 100
Processing page: 19 of 100
Processing page: 20 of 100
Processing page: 21 of 100
Processing page: 22 of 100
Processing page: 23 of 100
Processing page: 24 of 100
Processing page: 25 of 100
Processing page: 26 of 100
Processing page: 27 of 100
Processing page: 28 of 100
Processing page: 29 of 100
Processing page: 30 of 100
Processing page: 31 of 100
Processing page: 32 of 100
Processing page: 33 of 100
Processing page: 34 of 100
Processing page: 35 of 100
Processing page: 36 of 100
Processing page: 37 of 100
Processing

Unnamed: 0,id,user_id,created_at,authorDisplayName,author,authorAvatar,author_id,review_title,review_rating,review_content,...,industry_name,industry_slug,status_id,nps_rating,source,is_reported,business_reporting,author_created_date,author_total_reviews_count,attachments
0,4220709,189b4210-31fa-11e8-83f4-f23c91bb6188,2022-12-11 10:44:36,michael,michael,,189b4210-31fa-11e8-83f4-f23c91bb6188,Telkom eish,1,We upgraded our Telkom LTE to uncapped a few m...,...,Telecommunications,telecommunications,1,,WEBSITE,False,,2014-02-03,2,[]
1,4220657,2ab58790-5e1c-11ea-84a8-633da5152604,2022-12-11 08:21:07,Tintswalo M,Tintswalo M,,2ab58790-5e1c-11ea-84a8-633da5152604,Bad service,1,I use Adsl and I haven't had connectivity from...,...,Telecommunications,telecommunications,1,,WEBSITE,False,,2020-03-04,9,[]
2,4220589,30c06500-2db3-11ed-9a2e-37d9f06d41d3,2022-12-10 23:51:47,Lulama R,Lulama R,,30c06500-2db3-11ed-9a2e-37d9f06d41d3,Telkom (9570 Rand To Terminate A ConTract),1,They are Telling Me To Pay R9570 To Cancel A C...,...,Telecommunications,telecommunications,1,,WEBSITE,False,,2022-09-06,4,[]
3,4220221,fbc1cc90-f7d5-11e8-981e-d16280a3bb5a,2022-12-10 11:41:54,Moosa M,Moosa M,,fbc1cc90-f7d5-11e8-981e-d16280a3bb5a,Telkom sucks,1,Trying to resolve a billing issue with telkom...,...,Telecommunications,telecommunications,1,,WEBSITE,False,,2018-12-04,2,[]
4,4220027,f1989af0-247d-11ed-9b4a-437238555a96,2022-12-10 07:25:59,Johanna B,Johanna B,,f1989af0-247d-11ed-9b4a-437238555a96,BEST TELKOM EMPLOYEE,5,I am writing this as one very happy (now ex-) ...,...,Telecommunications,telecommunications,1,,WEBSITE,False,,2022-08-25,2,[]


### Lets check shape of the data

In [3]:
# shape of the data frame
df.shape

(1100, 27)

### Saving data for future use

In [4]:
# make a directory to store the data
import os
try:
    os.mkdir("data")
except:
    pass

# save the data frame to a csv file
df.to_csv("data/telkom.csv",index=False)

### Evaluation of Business Reviews

In [7]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentiment = SentimentIntensityAnalyzer()

# iterate over the data frame 
for index, row in df.iterrows():
    currentReview = row['review_title']+'\n'+row['review_content']
    sentimentScore = sentiment.polarity_scores(currentReview)
    df.loc[index,'compound'] = sentimentScore['compound']
    df.loc[index,'neg'] = sentimentScore['neg']
    df.loc[index,'neu'] = sentimentScore['neu']
    df.loc[index,'pos'] = sentimentScore['pos']

for index,row in df.iterrows():
    if row['compound'] >= 0.05:
        df.loc[index,'sentiment'] = 'positive'
    elif row['compound'] <= -0.05:
        df.loc[index,'sentiment'] = 'negative'
    else:
        df.loc[index,'sentiment'] = 'neutral'

# count the number of positive, negative and neutral reviews
print(df['sentiment'].value_counts())

# print percentage of positive, negative and neutral reviews
print(df['sentiment'].value_counts(normalize=True))

negative    1100
Name: sentiment, dtype: int64
negative    1.0
Name: sentiment, dtype: float64


Unnamed: 0,id,user_id,created_at,authorDisplayName,author,authorAvatar,author_id,review_title,review_rating,review_content,...,is_reported,business_reporting,author_created_date,author_total_reviews_count,attachments,compound,neg,neu,pos,sentiment
0,4220709,189b4210-31fa-11e8-83f4-f23c91bb6188,2022-12-11 10:44:36,michael,michael,,189b4210-31fa-11e8-83f4-f23c91bb6188,Telkom eish,1,We upgraded our Telkom LTE to uncapped a few m...,...,False,,2014-02-03,2,[],-0.6369,0.172,0.828,0.000,negative
1,4220657,2ab58790-5e1c-11ea-84a8-633da5152604,2022-12-11 08:21:07,Tintswalo M,Tintswalo M,,2ab58790-5e1c-11ea-84a8-633da5152604,Bad service,1,I use Adsl and I haven't had connectivity from...,...,False,,2020-03-04,9,[],-0.5223,0.153,0.719,0.128,negative
2,4220589,30c06500-2db3-11ed-9a2e-37d9f06d41d3,2022-12-10 23:51:47,Lulama R,Lulama R,,30c06500-2db3-11ed-9a2e-37d9f06d41d3,Telkom (9570 Rand To Terminate A ConTract),1,They are Telling Me To Pay R9570 To Cancel A C...,...,False,,2022-09-06,4,[],-0.2960,0.031,0.969,0.000,negative
3,4220221,fbc1cc90-f7d5-11e8-981e-d16280a3bb5a,2022-12-10 11:41:54,Moosa M,Moosa M,,fbc1cc90-f7d5-11e8-981e-d16280a3bb5a,Telkom sucks,1,Trying to resolve a billing issue with telkom...,...,False,,2018-12-04,2,[],-0.1646,0.038,0.934,0.028,negative
4,4220027,f1989af0-247d-11ed-9b4a-437238555a96,2022-12-10 07:25:59,Johanna B,Johanna B,,f1989af0-247d-11ed-9b4a-437238555a96,BEST TELKOM EMPLOYEE,5,I am writing this as one very happy (now ex-) ...,...,False,,2022-08-25,2,[],-0.9186,0.231,0.731,0.037,negative
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6,4107022,19deed74-31fa-11e8-83f4-f23c91bb6188,2022-09-26 12:32:20,Ivan,Ivan,,19deed74-31fa-11e8-83f4-f23c91bb6188,Worse customer service ever.,1,Canceled a month-to-month contract 3 months ag...,...,False,,2014-03-26,1,[],-0.8566,0.076,0.898,0.027,negative
7,4107004,253f4390-36f0-11eb-9146-d10abea54616,2022-09-26 12:22:20,Nurhaan C,Nurhaan C,,253f4390-36f0-11eb-9146-d10abea54616,The worst service provider,1,So I’ve got a contract with Telkom which is R4...,...,False,,2020-12-05,6,[],-0.8684,0.090,0.910,0.000,negative
8,4106907,223e5054-31fa-11e8-83f4-f23c91bb6188,2022-09-26 11:35:43,Rehana,Rehana,,223e5054-31fa-11e8-83f4-f23c91bb6188,HELL with TELKOM: HELLKOM should be their new ...,1,I am so frustrated at the moment that NO ONE a...,...,False,,2015-05-04,15,[],-0.9180,0.096,0.846,0.058,negative
9,4106748,dd819d30-3d74-11ed-a337-45091a2889d9,2022-09-26 10:35:59,DAVID A,DAVID A,,dd819d30-3d74-11ed-a337-45091a2889d9,Fraudulent debit from Telkom.,1,Never take any contract with telkom hatfield. ...,...,False,,2022-09-26,1,[],-0.5507,0.091,0.828,0.082,negative


In [14]:
# for each row in dataframe assign points
for index,row in df.iterrows():
    # if positive and review_rating is 5 then assign 100 points and author_total_reviews_count > 5 
    if row['sentiment'] == 'positive' and row['review_rating'] == 5 and row['author_total_reviews_count'] >= 5:
        df.loc[index,'points'] = 100
    # if positive and review_rating is 5 then assign 80 points and author_total_reviews_count < 5
    elif row['sentiment'] == 'positive' and row['review_rating'] == 5 and row['author_total_reviews_count'] < 5:
        df.loc[index,'points'] = 80
    # if positive and review_rating is 4 then assign 60 points and author_total_reviews_count > 5
    elif row['sentiment'] == 'positive' and row['review_rating'] == 4 and row['author_total_reviews_count'] >= 5:
        df.loc[index,'points'] = 60
    # if positive and review_rating is 4 then assign 40 points and author_total_reviews_count < 5
    elif row['sentiment'] == 'positive' and row['review_rating'] == 4 and row['author_total_reviews_count'] < 5:
        df.loc[index,'points'] = 40
    # if positive and review_rating is 3 then assign 20 points and author_total_reviews_count > 5
    elif row['sentiment'] == 'positive' and row['review_rating'] == 3 and row['author_total_reviews_count'] >= 5:
        df.loc[index,'points'] = 20
    # if positive and review_rating is 3 then assign 10 points and author_total_reviews_count < 5
    elif row['sentiment'] == 'positive' and row['review_rating'] == 3 and row['author_total_reviews_count'] < 5:
        df.loc[index,'points'] = 10
    # if negative and review_rating is 1 then assign -100 points and author_total_reviews_count > 5
    elif row['sentiment'] == 'negative' and row['review_rating'] == 1 and row['author_total_reviews_count'] >= 5:
        df.loc[index,'points'] = -100
    # if negative and review_rating is 1 then assign -80 points and author_total_reviews_count < 5
    elif row['sentiment'] == 'negative' and row['review_rating'] == 1 and row['author_total_reviews_count'] < 5:
        df.loc[index,'points'] = -80
    # if negative and review_rating is 2 then assign -60 points and author_total_reviews_count > 5  
    elif row['sentiment'] == 'negative' and row['review_rating'] == 2 and row['author_total_reviews_count'] >= 5:
        df.loc[index,'points'] = -60
    # if negative and review_rating is 2 then assign -40 points and author_total_reviews_count < 5
    elif row['sentiment'] == 'negative' and row['review_rating'] == 2 and row['author_total_reviews_count'] < 5:
        df.loc[index,'points'] = -40
    # if negative and review_rating is 3 then assign -20 points and author_total_reviews_count > 5
    elif row['sentiment'] == 'negative' and row['review_rating'] == 3 and row['author_total_reviews_count'] >= 5:
        df.loc[index,'points'] = -20
    # if negative and review_rating is 3 then assign -10 points and author_total_reviews_count < 5
    elif row['sentiment'] == 'negative' and row['review_rating'] == 3 and row['author_total_reviews_count'] < 5:
        df.loc[index,'points'] = -10
    # if neutral and review_rating is 3 then assign 0 points and author_total_reviews_count > 5
    elif row['sentiment'] == 'neutral' and row['review_rating'] == 3 and row['author_total_reviews_count'] >= 5:
        df.loc[index,'points'] = 0
    # if neutral and review_rating is 3 then assign 0 points and author_total_reviews_count < 5
    elif row['sentiment'] == 'neutral' and row['review_rating'] == 3 and row['author_total_reviews_count'] < 5:
        df.loc[index,'points'] = 0
    # if neutral and review_rating is 4 then assign 0 points and author_total_reviews_count > 5
    elif row['sentiment'] == 'neutral' and row['review_rating'] == 4 and row['author_total_reviews_count'] >= 5:
        df.loc[index,'points'] = 0
    # if neutral and review_rating is 4 then assign 0 points and author_total_reviews_count < 5
    elif row['sentiment'] == 'neutral' and row['review_rating'] == 4 and row['author_total_reviews_count'] < 5:
        df.loc[index,'points'] = 0
    # if neutral and review_rating is 5 then assign 0 points and author_total_reviews_count > 5
    elif row['sentiment'] == 'neutral' and row['review_rating'] == 5 and row['author_total_reviews_count'] >= 5:
        df.loc[index,'points'] = 0
    # if neutral and review_rating is 5 then assign 0 points and author_total_reviews_count < 5
    elif row['sentiment'] == 'neutral' and row['review_rating'] == 5 and row['author_total_reviews_count'] < 5:
        df.loc[index,'points'] = 0
    
# get maximum points
maxpoints = len(df) * 100
# score = total points / maximum points
score = (df['points'].sum() / maxpoints) * 100
print(score)
# work out average review rating
print(df['review_rating'].mean())

-89.0909090909091
1.2054545454545456


In [27]:
def FetchStats(slugname):
    url = 'https://api.hellopeter.com/consumer/business/'+slugname
    page = requests.get(url)
    jsondata=json.loads(page.text)
    return jsondata

data = FetchStats('game')

# get the industryLogo
industryLogo = data['industryLogo']
data
# get the trustIndex
trustIndex = data['trustIndex']
# get the industryName

keywords = data['businessDetails']['keywords']
#remove any keywords that contains hello or peter
keywords = [x for x in keywords if 'hello' not in x]
keywords = [x for x in keywords if 'peter' not in x]
keywords

# get positive count from df
positive_count = df[df['sentiment'] == 'positive'].count()['sentiment']
# get negative count from df
negative_count = df[df['sentiment'] == 'negative'].count()['sentiment']
# get neutral count from df
neutral_count = df[df['sentiment'] == 'neutral'].count()['sentiment']
 
data

businessDetails = data['businessDetails']

{'trustIndex': 1.8,
 'trustIndexNoRounding': '1.7730288',
 'industryRanking': 0,
 'industryName': 'Retail',
 'industryShortName': None,
 'industrySlug': 'retail',
 'industryLogo': '/static/img/industries/icons/retail-icon.jpg',
 'businessName': 'Game',
 'reviewCountTotal': 1832,
 'nps': {'promoters': 338,
  'passives': 258,
  'detractors': 2960,
  'totalRespondents': 3556,
  'nps_total': -74},
 'businessDetails': {'name': 'Game',
  'slug': 'game',
  'description': None,
  'logo': None,
  'responding': False,
  'isPaid': False,
  'website': 'http://comp_edit_thanks.asp',
  'facebook': None,
  'twitter': None,
  'linkedin': None,
  'google': None,
  'instagram': None,
  'address': '9th Floor, North Tower, Liberty Towers 214 Samora Machel Street, KZN, Durban',
  'email': 'service@game.co.za',
  'tel': '086142633273',
  'averageResponseTime': '31.773945',
  'keywords': ['game cellphone contract application',
   'game services',
   'games and gizmos',
   'which stores accept game card',
   