# PostNord Trustpilot Reviews

## Structure of the code

1. Combine header and text column
2. Anonymisation
    - Remove names
    - Remove dates
    - Create unique ids for each person
    - Use review and url to remove repeated comments
    - Remove urls
3. Cleaning and processing
    - If a comment was only punctuation => remove that row
4. Balance data (around 17,000 per category)
    - Randomise data (in categories)
    - Keep the first 17,000 of each category

## Initial code

In [169]:
#!pip install -U spacy
#!python -m spacy download da_core_news_md

In [170]:
# import packages
import pandas as pd
import os
import re
import spacy
import da_core_news_md

In [171]:
nlp = da_core_news_md.load()

In [172]:
# define path
path = os.path.join("..", "in", "postnord_trustpilot_reviews.csv")

In [173]:
# read csv
df = pd.read_csv(path)
# fill empty columns with white space
df.fillna(" ", inplace = True)
# rename columns
df.columns = ['order', 'name', 'date', 'rating', 'text', 'profile_link', 'review_count', 'header']

In [174]:
# make deep copy of the 20 first lines in the data
sm = df[:1000].copy(deep = True)

## Combination of header and text column

Hvis headeren er identisk med teksten eller hvis headerens længde er identisk med det tilsvarende vindue i teksten
    
```df['review'] = df['text']```

Ellers: `df['review'] = df['header'] + " " + df['text']`

_Kan også omvendes_

In [175]:
# remove dots from the end of the header
sm['header'] = sm['header'].str.replace('…', '', regex = False)

In [176]:
# create a list
review = []

# loop over the dataframe
for index, row in sm.iterrows():
    # txt is the text-column
    txt = row["text"]
    # head is the header-column
    head = row["header"]
    # search for the header text in the text-column
    x = re.search(f"^{re.escape(head)}", txt)
    # if the header text occurs in the text-column
    if x:
        # append the text column to the list
        review.append(row['text'])
    # otherwise...
    else:
        # append the header column and the text column to the list with a white space in between
        review.append(row['header'] + " " + row['text'])

In [177]:
# create a new review-column from the list
sm['review'] = review

## Removal of duplicates

In [178]:
len(sm)

1000

In [179]:
sm = sm.drop_duplicates(subset=['profile_link', 'date', 'review'], keep='first')

In [180]:
len(sm)

998

## Anonymisation

In [181]:
sm = sm.drop(columns=['name', 'date', 'profile_link', 'header', 'text', 'order', 'review_count'])

In [182]:
sm.head()

Unnamed: 0,rating,review
0,5,Hurtig levering.
1,5,Altid pakker til tiden
2,5,Som sædvanlig er min pakke leveret på bedste m...
3,5,"Hurtig behandling Alt fungerede, hurtig leveri..."
4,5,"Forbilledligt Pakkepost, når den er bedst!"


## Cleaning and processing

In [183]:
clean_review = []

for text in sm['review'].tolist():
    text = re.sub('[^\w\s]+', '', text)
    text = re.sub(' +', ' ', text)
    text = text.strip()
    text = text.lower()
    clean_review.append(text)

In [184]:
sm['review'] = clean_review

In [185]:
sm.head()

Unnamed: 0,rating,review
0,5,hurtig levering
1,5,altid pakker til tiden
2,5,som sædvanlig er min pakke leveret på bedste m...
3,5,hurtig behandling alt fungerede hurtig leverin...
4,5,forbilledligt pakkepost når den er bedst


In [186]:
# tokenisation and lemmatisation
lemmas = []

for x in sm['review']:
    document = nlp(x)
    temp = []
    for token in document:
        temp.append(token.lemma_)
    lemmas.append(temp)
    
sm['lemmas']=lemmas

In [187]:
sm

Unnamed: 0,rating,review,lemmas
0,5,hurtig levering,"[hurtig, levering]"
1,5,altid pakker til tiden,"[altid, pakke, til, tid]"
2,5,som sædvanlig er min pakke leveret på bedste m...,"[som, sædvanlig, være, min, pakke, levere, på,..."
3,5,hurtig behandling alt fungerede hurtig leverin...,"[hurtig, behandling, alt, fungere, hurtig, lev..."
4,5,forbilledligt pakkepost når den er bedst,"[forbilledligt, pakkepost, når, den, være, bedst]"
...,...,...,...
995,5,tak for super service,"[tak, for, super, service]"
996,5,hurtigt dejligt med smserne,"[hurtigt, dejligt, med, sms]"
997,5,hurtig leverig,"[hurtig, leverig]"
998,2,fik sms at min pakke ville blive leveret tirsd...,"[få, sms, at, min, pakke, ville, blive, levere..."


In [19]:
# removal of stopwords

## Balancing the data

In [20]:
# 17122 3-star reviews
sm['rating'].value_counts()

5    780
1     91
4     66
2     35
3     26
Name: rating, dtype: int64

In [21]:
def balance(dataframe, n=17000):
    """
    Create a balanced sample from imbalanced datasets.
    
    dataframe: 
        Pandas dataframe with a column called 'text' and one called 'label'
    n:         
        Number of samples from each label, defaults to 500
    """
    # Use pandas select a random bunch of examples from each label
    out = (dataframe.groupby('rating', as_index=False)
            .apply(lambda x: x.sample(n=n))
            .reset_index(drop=True))
    
    return out

In [2]:
#df_balanced = balance(df, 17000)

In [None]:
#df_balanced['rating'].value_counts()