<div>
<img src="images/icon_important.jpg" width="50" align="left"/>
</div>
<br>
<br>

### __Important Legal Notice__
By running and editing this Jupyter notebook with the corresponding dataset, you agree that you will not use or store the data for other purposes than participating in the Champagne Coding with DNB & Women in Data Science, Oslo. You will delete the data and notebook after the event and will not attempt to identify any of the commentors.

## Translating into English and cleaning up the data

Most libraries for sentiment analysis only support English, except for ```polyglot```, which is rather problematic to install. For that reason, we made a short script to attempt to translate and make a consistent set of review comments. 

__Note__ that if you get ```HTTP Error``` due to ```Too many requests```, you need to have a VPN client to modify your IP if you wish to continue running the functions for translating.

In [1]:
import pandas as pd
from pathlib import Path
current_directory = Path.cwd()
reviews_directory = Path(current_directory, 'reviews')

Read the file containing DNB reviews.

In [2]:
df = pd.read_csv(Path(reviews_directory, 'dnb_reviews.csv'))

Let's clean the data frame:
- remove duplicates
- delete empty rows
- remove the string "Full Review" from the ```Review Text``` column

In [3]:
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)
df = df.drop(['Unnamed: 0'], axis=1)
df['Review_Text'] = df['Review_Text'].map(lambda text: text.replace("...Full Review", ""))

#### Language Detection
Before we can translate, we need to know what language are reviews are in.

In [4]:
from textblob import TextBlob
from nltk.tokenize import sent_tokenize # for tokenizing into sentences
import statistics

In [5]:
from langdetect import detect
print(detect("Har ikke root tilgang, kommer fortsatt ikke"))
print(detect("The best way to accses dnb"))

no
en


A helper function to find the language on a given text string.

In [6]:
def detect_lang(text):
    if (detect(text) == 'en'):
        return 'en'
    if (detect(text) == 'no'):
        return 'no'
    else:
        return 'no'

In [7]:
df['Language'] = df['Review_Text'].apply(detect_lang)

df['Language'].unique()

array(['en', 'no'], dtype=object)

Let's explore the results- do the they make sense?

In [8]:
df[df['Language']=='en']['Review_Text'][:10]

0    complete trash. used to be ok. its hardly an a...
1    "sim tool launches before the page is done loa...
2    worst app ever? slower and affords less privac...
3    i have been using the old and new version of t...
4    har ikke root tilgang, kommer fortsatt ikke in...
5    this is the best app to know how much you have...
6    almost never works. often try to log in but ge...
7    after the recent update, i can now log into th...
8    i like the fact that it is easy to access my a...
9    very unreliable, but when it works its pretty ...
Name: Review_Text, dtype: object

In [9]:
df[df['Language']=='no']['Review_Text'][:10]

31                              new version is very good
33                                   worst bank app ever
168                           you're missing night mode.
197                  can't log in after update error1012
200                                          didn't work
204                             horrible after update...
207                                         doesn't work
213                                  it does not work 🙄🙄
217    edit: fungerer etter å reinstallere. fungerer ...
218    appen er helt grei. den viser saldo og siste t...
Name: Review_Text, dtype: object

How many of the records are English or Norwegian?

In [10]:
df[df['Language']=='en'].shape

(307, 4)

In [11]:
df[df['Language']=='no'].shape

(587, 4)

#### Translation
Now that we know which language our reviews are in, we can use Google's Translator API to translate from Norwegian to English.

In [12]:
from googletrans import Translator
translator = Translator()
try:
    print(translator.translate('Jeg har ikke penger', src='no').text)
except:
    print('Error, returning same string')

I do not have money


Translate the Norwegian reviews to english by applying our function.

__Note__ that if you get ```HTTP Error``` due to ```Too many requests```, you need to have a VPN client to modify your IP if you wish to continue running the functions for translating.

In [13]:
import sys

def translate_to_eng(text):
    try:
        return translator.translate(text, src='no').text
    except:
        emoji_stripped_text = text.encode('ascii', 'ignore').decode('ascii')
        try:
            return translator.translate(emoji_stripped_text, src='no').text
        except:
            print('Error, returning same string:\n', text)
            return text#, sys.exit('Exited with error.') # added sys exit so it doesn't try to translate continuously
    return text

In [14]:
df['Review_Eng'] = df[df['Language']=='no']['Review_Text'].apply(translate_to_eng)

Error, returning same string:
 kommer ikke inn (error 1015)
Error, returning same string:
 funksjonene jeg har brukt i denne appen, har utelukkende vært for å sjekke saldo, sjekke valuta kurs, og bruke vipps (de hadde en snarvei inne i appen tidligere). jeg nå nå logge inn for å sjekke saldo, ikke at det egentlig gjør meg noe som helst, men valuta kurs og vipps snarveien er nå borte, de j...full review
Error, returning same string:
 den nye versjonen er flere steg tilbake fra den forrige. færre funksjoner og mer klønete grensesnitt.
Error, returning same string:
 feilmelding 1010 ved innlogging. ingen info om denne meldingen på nettsiden. avinstallert og installert igjen. ingen endring. tømt app data, ingen endring.
Error, returning same string:
 og plutselig gjorde dere appen nytteløs for meg. om jeg velger å roote telefonen min, så er det mitt valg og risiko. nå som jeg ikke lengre kan benytte appen på telefonen min så kan ikke jeg ha dnb som hovedbank lengre. edit: det går helt fint

Fill out the column ```Review_Eng``` with the reviews that were originally in English. This way we have a complete column of reviews exclusively in English.

In [15]:
df.loc[df['Language'] == 'en', 'Review_Eng'] = df['Review_Text']

df[df['Language'] == 'no'][["Review_Eng", "Review_Text", "Language"]].sample(10)

Unnamed: 0,Review_Eng,Review_Text,Language
459,must log in with BankID for mobile phones ever...,"må logge inn med bankid på mobil hver gang, da...",no
462,"does not work, stop the logging. has rebooted ...","virker ikke, stopper opp på innloggingen. har ...",no
221,come on dnb .. this app has been on the market...,kom igjen da dnb.. denne appen har vært på mar...,no
336,"I tend not to write such feedback, but here I ...",jeg pleier ikke å skrive slike tilbakemeldinge...,no
316,cant even log in ..,cant even log in..,no
782,"""There has been an error."" 'Nuff said. dnb, sh...","""det har oppstått en feil."" 'nuff said. dnb, s...",no
359,very poor app! can not log in even though I us...,svært dårlig app! får ikke logge meg inn selvo...,no
389,"do not update, stick to the old app, this is s...","ikke oppdater, hold deg til den gamle appen,de...",no
608,velkommen til 90tallet....,velkommen til 90tallet....,no
871,can not log in to see the balance,kan ikke logge inn for å se saldo,no


Still some left without translating...

In [16]:
df.query('(Review_Text == Review_Eng) and Language == "no"')[['Review_Text', 'Review_Eng']].shape

(61, 2)

#### Textblob

Let's try another library. 

In [17]:
def translate_to_eng_textblob(text):
    text_blob = TextBlob(text)
    if text_blob.detect_language() != 'en':
        try:
            text_blob = text_blob.translate(to='en')
        except:
            #do nothing
            return str(text_blob)
    return str(text_blob)

This one works on small amounts of requests, otherwise it complains on "too many requests".

In [18]:
df['textblob_Translate'] = df.query('(Review_Text == Review_Eng) and Language == "no"')['Review_Text'].apply(translate_to_eng_textblob)

In [19]:
df.loc[df['textblob_Translate'].notnull(), 'Review_Eng'] = df['textblob_Translate']

How do the results look? We can see some that were incorrectly classified as Norwegian.

In [20]:
df.query('(Review_Text == Review_Eng) and Language == "no"')[['Review_Text', 'Review_Eng', 'Language']]

Unnamed: 0,Review_Text,Review_Eng,Language
31,new version is very good,new version is very good,no
168,you're missing night mode.,you're missing night mode.,no
229,ok app,ok app,no
248,all good!,all good!,no
249,excellent,excellent,no
284,ok app,ok app,no
320,wow,wow,no
321,bah,bah,no
334,it's good,it's good,no
335,gooooood,gooooood,no


Let's classify those as English.

In [21]:
df.loc[df.query('(Review_Text == Review_Eng) and Language == "no"').index, 'Language'] = 'en'

#### Review the results

In [22]:
df[df['Language'] == 'en'][['Review_Text', 'Review_Eng']].sample(10)

Unnamed: 0,Review_Text,Review_Eng
717,"the app is not practical at all, learn somethi...","the app is not practical at all, learn somethi..."
10,the few last updates are horrible. i have to l...,the few last updates are horrible. i have to l...
199,good to use this app,good to use this app
153,basically just a fancy shortcut to the web bro...,basically just a fancy shortcut to the web bro...
742,"does nothing, just redirecting to website, whi...","does nothing, just redirecting to website, whi..."
498,nothing works....,nothing works....
7,"after the recent update, i can now log into th...","after the recent update, i can now log into th..."
142,i can no longer login to the app or to the mob...,i can no longer login to the app or to the mob...
187,not anymore possible to login by fingerprint,not anymore possible to login by fingerprint
650,"never works and i am always frustrated....dnb,...","never works and i am always frustrated....dnb,..."


In [23]:
df[df['Language'] == 'no'][['Review_Text', 'Review_Eng']].sample(10)

Unnamed: 0,Review_Text,Review_Eng
369,den nye oppdateringen er helt forferdelig. den...,the new update is absolutely terrible. it has ...
417,"etter ny oppdatering, så er appen næremere en ...","after new update, then the app close more a ba..."
874,ble fra bra til svart dårlig,was from good to poor black
545,virker veldig bra,work very well
261,har nå brukt denne appen en god stund. biometr...,have now used this app for a while. biometric ...
392,appen var ok før. nå er den helt ubrukelig. in...,app was ok before. Now it is completely useles...
296,appen er veldig utdatert og stygg. ser ut som ...,app is really outdated and ugly. looks like it...
455,den gamle virket. det gjør ikke denne. får ikk...,the old wood. it does not do this. Can not log...
273,"ubrukelig. får ikke til å initialisere appen, ...","useless. am unable to initialize the app, even..."
252,"fin. nyttig. liker i prinsippet oppdateringen,...","fine. useful. like in principle patch, had får..."


### Anonymizing the names.

In [24]:
from faker import Faker
fake = Faker('no_NO')

def anonymous_name (text):
    return fake.name()
try:
    df['Name'] = df['Name'].apply(anonymous_name)
except:
    print("This dataframe doesn't contain any names to be anonymized.")

This dataframe doesn't contain any names to be anonymized.


In [25]:
df[df['Language'] == 'no'].sample(10)

Unnamed: 0,Date,Review_Score,Review_Text,Language,Review_Eng,textblob_Translate
309,"March 14, 2019",3,etter oppdatering av android fikk jeg beskjed ...,no,after updating the android I was told to reins...,
448,"March 14, 2019",1,innlogging med finger avtrykk funker ikke.. må...,no,login with fingerprint does not work .. must u...,
297,"March 25, 2019",1,ubrukelig dritt app etter nye oppdatering må j...,no,useless crap app for new updates I have to log...,
381,"February 24, 2019",1,etter stor-oppdateringen har jeg ikke kunnet l...,no,after major update I have not been able to log...,
580,"March 20, 2019",1,dårlig. gi oss tilbake den gamle appen.,no,bad. give us back the old app.,bad. give us back the old app.
682,"November 2, 2017",3,den er ok.,no,It is OK.,
238,"July 19, 2019",1,"det virker ikke å overføre mellom kontoer, som...",no,"it does not transfer between accounts, which i...",
452,"January 26, 2019",1,"nye oppdateringen er skikkelig dårlig, den for...",no,"new update is really bad, the last app was muc...",
318,"March 16, 2019",1,elendig fingeravtrykk login,no,lousy fingerprint login,
453,"February 12, 2019",5,etter jeg sletta og installerte appen igjen ku...,no,after i deleted and installed the app I could ...,


In [26]:
df[df['Language'] == 'en'].sample(10)

Unnamed: 0,Date,Review_Score,Review_Text,Language,Review_Eng,textblob_Translate
792,"April 25, 2013",4,does all the important things,en,does all the important things,
117,"January 17, 2019",1,"it was so good before, that i could check my a...",en,"it was so good before, that i could check my a...",
60,"March 10, 2019",1,this app is just an appified web page that req...,en,this app is just an appified web page that req...,
132,"May 31, 2019",1,crashes 9 of 10 times,en,crashes 9 of 10 times,
50,"March 30, 2019",1,app asked me for uninstalling and installing o...,en,app asked me for uninstalling and installing o...,
112,"March 15, 2019",1,out of all the 500 chatacters im allowed to wr...,en,out of all the 500 chatacters im allowed to wr...,
796,"April 9, 2013",5,i can now see my balance on all account withou...,en,i can now see my balance on all account withou...,
35,"May 16, 2019",4,"works so far, but how do i change it to show i...",en,"works so far, but how do i change it to show i...",
622,"October 19, 2016",2,before ot was a reallu good app but its imposs...,en,before ot was a reallu good app but its imposs...,
91,"January 31, 2019",1,claims my brand new phone is rooted and blocks...,en,claims my brand new phone is rooted and blocks...,


In [27]:
df = df.drop(['textblob_Translate'], axis=1)

df.to_csv(Path(reviews_directory,'dnb_reviews_final.csv'))