# Data Preprocessing 

## Schritte:

1. Einlesen der Reviews
2. Preprocessing Schritte durchführen
3. Ausgeben der vereinfachten Liste 



## 1. Einlesen der Reviews


Importiere der benötigten Bibliotheken.

In [2]:
import re, string, unicodedata
import nltk
import contractions
import inflect
from bs4 import BeautifulSoup
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer

Importieren der Reviews in ein Datafile (df).

In [3]:
import gzip

def parse(path):
  g = gzip.open(path, 'r')
  for l in g:
    yield eval(l)
df = list(parse("../Rumprobieren/data/data.json.gz"))

Testausgabe des ersten Reviewtext aus dem Dataframe.

In [4]:
df[0]["reviewText"]

"Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,"

## 2. Data Preprocessing Schritte durchführen

### 2.1 Contractions

Wandelt englische Wörter wie: "you've, it's, don't" zu deren ausgeschriebener Grundform um.

Eingabe: Belieber Text als String:
"It's a nice day."

Ausgabe: Text als String:
"It is a nice day"


In [5]:
def replace_contractions(text):
    return contractions.fix(text)

In [6]:
text = "Hello, it's a very nice day."
print(replace_contractions(text))

Hello, it is a very nice day.


### 2.2 Tokenize

Wandelt alle Wörter und Satzzeichen eines Textes in eine Liste mit einzelenen Wörtern um.

Eingabe: Beliebiger Text als String: 
"It's a nice day."

Ausgabe: Liste mit Wörtner und Satzzeichen als Listenelemente: 
['Hello', ',', 'it', "'s", 'a', 'very', 'nice', 'day', '.']

In [7]:
def tokenize_words(text):
    return nltk.word_tokenize(text)

In [8]:
text2 = "Hello, it's a very nice day."
print(tokenize_words(text2))

['Hello', ',', 'it', "'s", 'a', 'very', 'nice', 'day', '.']


### 2.3 Non-ascii entfernen 

Entfernt alle nicht Ascii zeichen aus einer Liste mit tokenized Wörten.

Eingabe: Liste mit tokenized Words: ['Hello', ',', 'it', "'s", 'a', 'very', 'nice', 'day', '.']

Ausgabe: Liste nur mit Ascii Zeichen: ['Hello', ',', 'it', "'s", 'a', 'very', 'nice', 'day', '.']
    

In [9]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

In [10]:
liste1 = ['Hello', '我', 'it', "'s", '我', 'very', 'nice', 'day', '.']
print(remove_non_ascii(liste1))

['Hello', '', 'it', "'s", '', 'very', 'nice', 'day', '.']


### 2.4 Convert to Lowercase

In [11]:

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def replace_numbers(words):
    """Replace all interger occurrences in list of tokenized words with textual representation"""
    p = inflect.engine()
    new_words = []
    for word in words:
        if word.isdigit():
            new_word = p.number_to_words(word)
            new_words.append(new_word)
        else:
            new_words.append(word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def normalize(words):
    words = replace_contractions(words)
    words = tokenize_words(words)
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = replace_numbers(words)
    words = remove_stopwords(words)
    #words = lemmatize_verbs(words)
    
    return words


In [22]:
#liste = []

#for i in range(5):
   # var = replace_contractions(df[i]["reviewText"])
   # var = nltk.word_tokenize(var)
   # var = normalize(var)
   # liste.append(var)

#print(liste[2])

In [23]:
liste =[]

In [28]:
for i in range(len(df)):
    var = normalize(df[i]["reviewText"])
    liste.append(var)

10266
