<div>
<img src="images/icon_important.jpg" width="50" align="left"/>
</div>
<br>
<br>

### __Important Legal Notice__
By running and editing this Jupyter notebook with the corresponding dataset, you agree that you will not use or store the data for other purposes than participating in the Champagne Coding with DNB & Women in Data Science, Oslo. You will delete the data and notebook after the event and will not attempt to identify any of the commentors.

## Translating into English and cleaning up the data

Most libraries for sentiment analysis only support English, except for ```polyglot```, which is rather problematic to install. For that reason, we made a short script to attempt to translate and make a consistent set of review comments. 

__Note__ that if you get ```HTTP Error``` due to ```Too many requests```, you need to have a VPN client to modify your IP if you wish to continue running the functions for translating.

In [None]:
import pandas as pd
from pathlib import Path
current_directory = Path.cwd()
reviews_directory = Path(current_directory, 'reviews')

Read the file containing DNB reviews.

In [None]:
df = pd.read_csv(Path(reviews_directory, 'dnb_reviews.csv'))

Let's clean the data frame:
- remove duplicates
- delete empty rows
- remove the string "Full Review" from the ```Review Text``` column

In [None]:
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)
df = df.drop(['Unnamed: 0'], axis=1)
df['Review_Text'] = df['Review_Text'].map(lambda text: text.replace("...Full Review", ""))

#### Language Detection
Before we can translate, we need to know what language are reviews are in.

In [None]:
from textblob import TextBlob
from nltk.tokenize import sent_tokenize # for tokenizing into sentences
import statistics

In [None]:
from langdetect import detect
print(detect("Har ikke root tilgang, kommer fortsatt ikke"))
print(detect("The best way to accses dnb"))

A helper function to find the language on a given text string.

In [None]:
def detect_lang(text):
    if (detect(text) == 'en'):
        return 'en'
    if (detect(text) == 'no'):
        return 'no'
    else:
        return 'no'

In [None]:
df['Language'] = df['Review_Text'].apply(detect_lang)

df['Language'].unique()

Let's explore the results- do the they make sense?

In [None]:
df[df['Language']=='en']['Review_Text'][:10]

In [None]:
df[df['Language']=='no']['Review_Text'][:10]

How many of the records are English or Norwegian?

In [None]:
df[df['Language']=='en'].shape

In [None]:
df[df['Language']=='no'].shape

#### Translation
Now that we know which language our reviews are in, we can use Google's Translator API to translate from Norwegian to English.

In [None]:
from googletrans import Translator
translator = Translator()
try:
    print(translator.translate('Jeg har ikke penger', src='no').text)
except:
    print('Error, returning same string')

Translate the Norwegian reviews to english by applying our function.

__Note__ that if you get ```HTTP Error``` due to ```Too many requests```, you need to have a VPN client to modify your IP if you wish to continue running the functions for translating.

In [None]:
import sys

def translate_to_eng(text):
    try:
        return translator.translate(text, src='no').text
    except:
        emoji_stripped_text = text.encode('ascii', 'ignore').decode('ascii')
        try:
            return translator.translate(emoji_stripped_text, src='no').text
        except:
            print('Error, returning same string:\n', text)
            return text, sys.exit('Exited with error.') # added sys exit so it doesn't try to translate continuously
    return text

In [None]:
df['Review_Eng'] = df[df['Language']=='no']['Review_Text'].apply(translate_to_eng)

Fill out the column ```Review_Eng``` with the reviews that were originally in English. This way we have a complete column of reviews exclusively in English.

In [None]:
df.loc[df['Language'] == 'en', 'Review_Eng'] = df['Review_Text']

df[df['Language'] == 'no'][["Review_Eng", "Review_Text", "Language"]].sample(10)

Still some left without translating...

In [None]:
#df.query('(Review_Text == Review_Eng) and Language == "no"')[['Review_Text', 'Review_Eng']].shape
df[(pd.isnull(df.Review_Eng)) & (df.Language == 'no')]

#### Textblob

Let's try another library. 

In [None]:
def translate_to_eng_textblob(text):
    text_blob = TextBlob(text)
    if text_blob.detect_language() != 'en':
        try:
            text_blob = text_blob.translate(to='en')
        except:
            #do nothing
            return str(text_blob)
    return str(text_blob)

This one works on small amounts of requests, otherwise it complains on "too many requests".

In [None]:
df.loc[((pd.isnull(df.Review_Eng)) & (df.Language == 'no')), 
       'textblob_Translate'] = df.loc[((pd.isnull(df.Review_Eng)) & (df.Language == 'no'))]['Review_Text'].apply(
    translate_to_eng_textblob)

How do the results look?

In [None]:
df.loc[df['textblob_Translate'].notnull(), 'Review_Eng'] = df['textblob_Translate']

If you see the values, they are mostly badly categorised english sentences.

In [None]:
df.query('(Review_Text == Review_Eng) and Language == "no"')[['Review_Text', 'Review_Eng', 'Language']]

Let's classify those as English.

In [None]:
df.loc[df.query('(Review_Text == Review_Eng) and Language == "no"').index, 'Language'] = 'en'

#### Review the results

In [None]:
df[df['Language'] == 'en'][['Review_Text', 'Review_Eng']].sample(10)

In [None]:
df[df['Language'] == 'no'][['Review_Text', 'Review_Eng']].sample(10)

### Anonymizing the names.

In [None]:
from faker import Faker
fake = Faker('no_NO')

def anonymous_name (text):
    return fake.name()
try:
    df['Name'] = df['Name'].apply(anonymous_name)
except:
    print("This dataframe doesn't contain any names to be anonymized.")

In [None]:
df[df['Language'] == 'no'].sample(10)

In [None]:
df[df['Language'] == 'en'].sample(10)

In [None]:
df = df.drop(['textblob_Translate'], axis=1)

df.to_csv(Path(reviews_directory,'dnb_reviews_final.csv'))