# Data Mining, Preparation and Understanding
Today we'll go through Data Mining, Preparation & Understanding which is a really fun one (and important).  
In this notebook we'll try out some important libs to understand & also learn how to parse Twitter with some help from `Twint`. All in all we'll go through `pandas`, `twint` and some more - let's start by installing them.

In [0]:
%%capture
!pip install twint
!pip install wordcloud
import twint
import pandas as pd
import tqdm
import nltk
nltk.download('stopwords')

## Tonights theme: ÅF Pöyry (and perhaps some AFRY)
To be a Data Miner we need something to mine.
![alt text](https://financesmarti.com/wp-content/uploads/2018/09/Dogcoin-Mining-min.jpg)
In this case it won't be Doge Coin but rather ÅF, ÅF Pöyry & AFRY.

To be honest, it's not the best theme (pretty generic names ones you go ASCII which we'll do to simplify our lifes. 

### What is Twint
`Twint` is a really helpful library to scrape Twitter, it uses the search (i.e. not the API) and simplifies the whole process for us as users.  
The other way to do this would be to use either the API yourself (time-consuming to learn and also limited in calls) or to use BS4 (Beatiful Soup) which is a great python-lib to scrape websites. But I'd dare say that it is better for static content sites such as Wikipedia, Aftonbladet etc rather than Twitter etc.  
This all together led to the choice of `Twint` _even_ though it has a **huge** disadvantage - it does not support UTF8 from what I can find. 


### What is pandas
Pandas is a library to parse, understand and work with data. It's really fast using the `DataFrame` they supply.  
Using this `DataFrame` we can manipulate the data in different ways. It has all the functions you can imagine from both SQL and Excel, a great tool all in all.

### Bringing it all together
Let's take a look at how we can use this all together!

First a quick look at the Twint config.

In [0]:
"""
Twint Config:

Variable             Type       Description
--------------------------------------------
Retweets             (bool)   - Display replies to a subject.
Search               (string) - Search terms
Store_csv            (bool)   - Set to True to write as a csv file.
Pandas               (bool)   - Enable Pandas integration.
Store_pandas         (bool)   - Save Tweets in a DataFrame (Pandas) file.
Get_replies          (bool)   - All replies to the tweet.
Lang                 (string) - Compatible language codes: https://github.com/twintproject/twint/wiki/Langauge-codes (sv, fi & en supported)
Format               (string) - Custom terminal output formatting.
Hide_output          (bool)   - Hide output.

Rest of config: https://github.com/twintproject/twint/wiki/Configuration
"""

In [0]:
c = twint.Config()
c.Query
c.Search = " ÅF "
c.Format = "Username: {username} |  Tweet: {tweet}"
c.Pandas = True
c.Store_pandas = True
c.Pandas_clean = True
c.Show_hashtags = True
c.Limit = 10


In [0]:
twint.run.Search(c)

**What do we see?**
No Swedish, what so ever. This is not interesting for our usecase as all the tweets are about something else really.  
Let's try ÅF Pöyry instead

In [0]:
c.Search = "ÅF AFRY Pöyry"
twint.run.Search(c)

Looking at this we have a much better result. This really shows the power of Ngrams (bigram).  
Let's play around some in the next box trying `@AFkarriar` as keyword and also to include `Replies` and some other fields.

In [0]:
c.Replies = True
twint.run.Search(c)
# Play around with params, do whatever!

### Results
Ok, so we have tried out a few different things we can use in `Twint`. For me `@AFkarriar` worked out best - **what was your favorite?**  

Let's analyze some more.

In [0]:
FILENAME = "afpoyry.csv"
c = twint.Config()
c.Query
c.Show_hashtags = True
c.Search = "ÅF"
c.Lang = "sv"
#c.Get_replies = True
c.Store_csv = True
c.Hide_output = True
c.Output = FILENAME
twint.run.Search(c)

In [0]:
data = pd.read_csv(FILENAME)
print(data.shape)
print(data.dtypes)

### Cleaning
We can most likely clean some titles from here, just to make it simpler for us

In [0]:
data_less = data.filter(["tweet", "username"])
data_less.head()

In [0]:
data_less["tweet"].head()

In [0]:
from wordcloud import WordCloud
from IPython.display import Image

t = '\n'.join([x.tweet for i, x in data_less.iterrows()])

WordCloud().generate(t).to_file('cloud.png')
Image('cloud.png')

**Stop Words** - Anyone remember? Let's remove them!  
NLTK is a great toolkit for just about everything in NLP, we can find a list of stopwords for most languages here, including Swedish.

In [0]:
from nltk.corpus import stopwords
swe_stop = set(stopwords.words('swedish'))
list(swe_stop)[:5]

**Stemming** - Anyone remember? Let's do it!
NLTK is _the_ lib to use when you want at least _some_ swedish. But I think I've squeezed all the swedish out of NLTK that I can find right now... 

In [0]:
from nltk.stem import SnowballStemmer
 
stemmer = SnowballStemmer("swedish")
stemmer.stem("hoppade")

**Cleaning** - Anyone remember? Let's do it!  
![alt text](https://imgflip.com/s/meme/X-All-The-Y.jpg)  
To have a "better" word cloud we need to reduce the dimensions and keep more important words.

In [0]:
%%capture
!pip install regex

In [0]:
from string import punctuation
import regex as re

# bad_words = re.compile("https|http|pic|www|och|med|att|åf|pöyry|läs")
http_re = re.compile("https?.*?(\w+)\.\w+(\/\s)?")
whitespace_re = re.compile("\s+")
punc_set = set(punctuation)

def clean_punct(tweet):
  return ''.join([c for c in tweet if c not in punc_set])

def remove_stopwords(tweet):
   return " ".join([t for t in tweet.split(" ") if t not in swe_stop])

# Example of cleaning: remove punct, lowercase, https and stemming/lemmatizing
# (we want to reduce the space/dimensions)
def clean_text(tweet):
  tweet = tweet.lower()
  tweet = ' '.join([word for word in tweet.split() if not word.startswith('pic.')])
  tweet = http_re.sub(r'\1', tweet)
  tweet = tweet.lower()
  tweet = remove_stopwords(clean_punct(tweet)).strip()
  tweet = whitespace_re.sub(' ', tweet)
  return tweet

clean_text("hej där borta. hur mår du? vem vet.. Jag vet  inte. http:/google.com pic.twitterlol")
#data_less["tweet"] = data_less["tweet"].apply(lambda x: clean_text(x))

In [0]:
data_less["tweet"]

In [0]:
from wordcloud import WordCloud
from IPython.display import Image

t = '\n'.join([x.tweet for i, x in data_less.iterrows()])

WordCloud().generate(t).to_file('cloud_clean.png')
Image('cloud_clean.png')

In [0]:
from collections import Counter

def print_most_common(wcount, n=5):
  for (name, count) in wcount.most_common(n):
    print(f"{name}: {count}")

In [0]:
t_hash = ' '.join([x for x in t.split() if x.startswith("#")])
hash_count = Counter(t_hash.split())
WordCloud().generate(t_hash).to_file('cloud_#.png')

print_most_common(hash_count, 10)

In [0]:
t_at = ' '.join([x for x in t.split() if x.startswith("@")])
at_count = Counter(t_at.split())
WordCloud().generate(t_at).to_file('cloud_@.png')

print_most_common(at_count, 10)

### WordClouds!
Let's take a look at what we've got.

In [0]:
Image('cloud_clean.png')

In [0]:
Image('cloud_no_stop.png')

In [0]:
Image('cloud_@.png')

In [0]:
Image('cloud_#.png')

### What to do?
A big problem with Swedish is that there's very few models which we can do some fun with, and our time is very limited.  
Further on we can do the following:


1.   Look at Ngram see if we can see common patterns
2.   ...


In [0]:
"""
1. Try perhaps some type of Ngrams
4. Find different shit
4. Try to find connections
5. Move over to spark (?)
https://towardsdatascience.com/nlp-for-beginners-cleaning-preprocessing-text-data-ae8e306bef0f
https://medium.com/@kennycontreras/natural-language-processing-using-spark-and-pandas-f60a5eb1cfc6
"""

### AFRY
Let's create a wordcloud & everything for AFRY. This is for you to implement fully!

In [0]:
FILENAME2 = "afry.csv"
c = twint.Config()
c.Query
c.Show_hashtags = True
c.Search = "afry"
c.Lang = "sv"
c.Get_replies = True
c.Store_csv = True
c.Hide_output = True
c.Output = FILENAME2
twint.run.Search(c)

In [0]:
data_afry = pd.read_csv(FILENAME2)
t_afry = '\n'.join([x.tweet for i, x in data_afry.iterrows()])
WordCloud().generate(t_afry).to_file('cloud_afry.png')

Image('cloud_afry.png')

### Jonas Sjöstedt (jsjostedt) vs Jimmy Åkesson (jimmieakesson)
Implementation as follows:
1. Get data for both (tip: use `c.Username` or `c.User_id` and don't forget formatting output in terminal if used)
2. Clean data
3. ?? (Perhaps wordclouds etc)
4. TfIdf
5. Join ds & shuffle, train clf
6. Testing!

## Jimmie Åkesson

In [0]:
FILENAME = "jimmie2.csv"
c = twint.Config()
c.Query
c.Show_hashtags = True
#c.Search = "ÅF"
c.Username = "jimmieakesson"
#c.Get_replies = True
c.Store_csv = True
c.Output = FILENAME
twint.run.Search(c)

In [0]:
data_jimmie = pd.read_csv(FILENAME)
print(data_jimmie.shape)

data_less_jimmie = data_jimmie.filter(["tweet", "username"])
data_less_jimmie.head()

In [0]:
data_less_jimmie["tweet"] = data_less_jimmie["tweet"].apply(lambda x: clean_text(x))
data_less_jimmie.head()

In [0]:
from wordcloud import WordCloud
from IPython.display import Image

t = '\n'.join([x.tweet for i, x in data_less_jimmie.iterrows()])

WordCloud().generate(t).to_file('cloud_clean_jimmie.png')
Image('cloud_clean_jimmie.png')

## Jonas Sjöstedt

In [0]:
FILENAME_J = "jonas.csv"
c = twint.Config()
c.Query
c.Show_hashtags = True
#c.Search = "ÅF"
c.Username = "jsjostedt"
#c.Get_replies = True
c.Store_csv = True
c.Hide_output = True
c.Output = FILENAME_J
twint.run.Search(c)


In [0]:
data_jonas = pd.read_csv(FILENAME_J)
print(data_jonas.shape)

data_less_jonas = data_jonas.filter(["tweet", "username"])
data_less_jonas.head()

data_less_jonas["tweet"] = data_less_jonas["tweet"].apply(lambda x: clean_text(x))
data_less_jonas.head()

t = '\n'.join([x.tweet for i, x in data_less_jonas.iterrows()])

WordCloud().generate(t).to_file('cloud_clean_jonas.png')
Image('cloud_clean_jonas.png')

# TfIdf

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
cv=TfidfVectorizer(ngram_range=(1,1))

In [0]:
word_count_vector_jonas = cv.fit_transform(data_less_jonas["tweet"])

In [0]:
feature_names = cv.get_feature_names()

#get tfidf vector for first document
first_document_vector=word_count_vector_jonas[0]
 
#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

In [0]:
word_count_vector_jimmie = cv.fit_transform(data_less_jimmie["tweet"])

In [0]:
feature_names = cv.get_feature_names()

#get tfidf vector for first document
first_document_vector=word_count_vector_jimmie[2]
 
#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

# Join dfs & shuffle, train clf

In [0]:
print(data_jimmie.shape)
print(data_jonas.shape)

In [0]:
from sklearn.utils import shuffle
tfidf = TfidfVectorizer(ngram_range=(1,2))

data_less_jonas = data_less_jonas.head(2581)
print(data_less_jonas.shape)

combined = pd.concat([data_less_jimmie,data_less_jonas])
combined = shuffle(combined)
print(combined.shape)
combined.head()

In [0]:
from sklearn.model_selection import train_test_split

tweet_tfidf = tfidf.fit_transform(combined["tweet"])
X_train, X_test, y_train, y_test = train_test_split(tweet_tfidf, combined["username"], test_size=0.1, random_state=42)
X_train[:3]

In [0]:
from sklearn.svm import LinearSVC

clf = LinearSVC()

model = clf.fit(X_train, y_train)

In [0]:
from sklearn.metrics import classification_report
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

# Testing!

In [0]:
def testClassifier(tweet):
  vector = tfidf.transform([clean_text(tweet)])

  print(model.predict(vector))

In [0]:
testClassifier("")

In [0]:
testClassifier("Arbetsmarknaden är inte fri svenska kollektivavtal privatisering arbetslösa kommun")

# Going forward
I see 4 options:


1.   Find stuffs that can help people in the office (@AFRY)
2.   Create models for Swedish and perhaps Open Source
3.   Make "interesting"/"fun" stuffs (such as applying Text Generation on something like Cards Against Humanity etc)
4.   Try something new (perhaps Image Recognition?)

Focusing on Swedish is only possible in 1 & 2.

Some concrete options:
* Explore SparkNLP
* Ask around at AFRY for things to automate
* Apply text-generation with SOTA to generate either something like Cards Against Humanity or some persons Tweet etc.
* Create datasets to create Swedish models on (might need a mech-turk; this will be pretty time-consuming before we see any type of results).
* Something completely different.
