# Preprocessing raw/any Dutch tweets 


**This preprocessing was done before getting the tweets annotated for training. In other words, this preprocessing is for both the annotators and system input; however, the system architecture requires additional preprocessing which is presented in the other 2 notebooks containing the CNN-BiLSTM and the Baseline SVM **

The preprocessing here involves the following: 

1.Getting only Dutch tweets. 

2.Getting only the tweets that are more than 25 characters. 

3.Removing the author of the tweet.

4.Removing of hyperlinks and other unwanted characters.


**1. Import necessary modules**


In [2]:
import csv
import re
import pandas as pd
import numpy as np 
from langdetect import detect


**2. Split CSV file into desired columns**

In [8]:
filename = 'example_raw_tweets.csv' #or any other CSV containing raw tweets in this format
def read_csv(filename, encoding='utf-8'):
    """
    Reads a csv file using csv reader and splits each row into lists.
    (Make sure the csv file contains all the correct columns and rows)
    """
    authorlist = []
    datelist = []
    partylist = []
    contentlist = []
    with open(filename, 'r') as csvfile: 
        reader = csv.reader(csvfile, delimiter='\t')
        for row in reader:
            author = row[0]
            date = row[1]
            party = row[2]
            content = row[3]
            if author == 'AUTEUR': 
                continue
            if date == 'DATUM':
                continue
            if party == 'PARTIJ': 
                continue
            if content == 'CONTENT"':
                continue
            contentlist.append(content)
            authorlist.append(author)
            partylist.append(party)
            datelist.append(date)
            

    return authorlist, datelist, partylist, contentlist
authorlist, datelist, partylist, contentlist = read_csv(filename) #date, party, content = read_csv(filename)


**3. Get only the tweets that are written in Dutch & are more than 25 characters**

In [9]:
def get_only_dutch(author, date, party, content):
    """
    Using the langdetect package, it only gets the tweets written in dutch. 
    It also gets rid of tweets that have less than 25 characters. 
    """
    language = []
    for tweet in content:
        language.append(detect(tweet))
    
    zipped = list(zip(author, date, party, content, language))
    nested_list = [list(item) for item in zipped]

    only_dutch = []
    for item in nested_list:
        if 'nl' in item:
            only_dutch.append(item)
    
    only_longer_tweets = []
    for item in only_dutch: 
        if len(item[3]) > 25:   ## get only the tweets that are more than 25 characters. 
            only_longer_tweets.append(item)
            
    author = []
    date = []
    party = []
    content = []
    for item in only_longer_tweets:
        author.append(item[0])
        date.append(item[1])
        party.append(item[2])
        content.append(item[3])
        
    return author, date, party, content

author, date, party, content = get_only_dutch(authorlist, datelist, partylist, contentlist)
len(content)  
    

    

2190

**4. More preprocessing using regular expressions**

In [10]:
def clean_tweets(content):
    """
    Using output of 'read_csv' or 'read_csv_2nd_way' function, cleans author list 
    to leave behind pure content only.
    Module to use: regex, 'import re'
    """
    cont=[]
    for item in content: 
        first = re.sub(r'^.*?:', '', item)
        cont.append(first)
    cleaned1=[]
    for item in cont:
        links = re.sub(r"http\S+",'', item)
        cleaned1.append(links) 
    cleaned2=[]
    for item in cleaned1:
        users = re.sub("@[^:]*:", '', item) #removes the users that posted/retweeted, but not the users mentioned 
                                            #inside the tweet
        cleaned2.append(users)
        #users = re.sub(r'@\S+','', item)  #this line removes all users, but we want to keep the ones in the tweet.
    check_nr_tweets = len(cleaned2)
   
    cleaned3 = []
    for item in cleaned2:
        unicodes=re.sub(r'(\\u[0-9A-Fa-f]+)','', item)
        #unicodes=item.replace('\u2066','').replace('\u2069','').replace('\xa0', '')
        cleaned3.append(unicodes)
    
    #cleaned4 = []
    #for item in cleaned3: 
        #unicodes2=item.decode('unicode_escape')
        #cleaned4.append(unicodes2)
    
    return cleaned3

cleaned = clean_tweets(content)


In [12]:
print(cleaned[:5])

[' RT  ‘Laat de superrijken gewoon hun belastingen betalen . #tegenlicht ', ' RT  De politiek in het bijzonder #D66 #GL @COCNederland kijken weg hebben het druk met Polen en Hongarije te bekritiseren. #homogeweld ', ' RT  De euro strompelt van crisis naar crisis. De huidige situatie is onhoudbaar en benadeelt zowel Noord-Europa als Zuid-Europa. Het is tijd om de euro te ontvlechten. #FVD  via @fvdemocratie', ' Op uitnodiging van  Over superrijken die geen belasting betalen, dus niet bijdragen maar wel profiteren van de samenleving en vervolgens mooi sier maken met kruimeltjes liefdadigheid. Waarom pikken wij dit nog? ', ' RT  Een succesvolle maatschappij is een vooruitgangsmachine, maar die machine is kapot - schrijver Anand Giridharadas #tegenlicht ']
