# iX-Artikel "Beziehungssache"
von Stefanie Scholz und Christian Winkler

## Voraussetzungen
Leider können wir nicht alle einzelnen Verarbeitungsschritte darstellen, die für die Erzeugung der Grafik im Notebook notwendig sind, da dazu die gesamte Datenmenge des Subreddits benötigt wird.

## Topic Models

Die Gesamtdaten für die Topic Models sind zu groß und können nicht als Download bereitgestellt werden. Stattdessen wird hier im Notebook ein Topic Model für die Titel der Toplevel-Posts berechnet.

### Achtung: dadurch ergeben sich andere Ergebnisse als im Artikel!

In [None]:
import pandas as pd
all_posts = pd.read_csv("https://github.com/datanizing/ix-reddit/raw/main/all-toplevel-posts.csv.gz", parse_dates=["created_utc"])

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS as stop_words

# Stoppworte um Reddit-"Slang" ergänzen
for w in "amp at blog body buy buycheap call\
            can case change cheap co com could\
            create delete download drive email first fix\
            fuck go good help how http https\
            just late look make market message more\
            need new news now number online oral\
            page pass post question reddit remove review\
            say search self send should site support\
            test text time top unlock use video\
            watch way why will work".split(" "):
    stop_words.add(w)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,2), max_df=0.7, min_df=5, max_features=10000, stop_words=stop_words)
tfidf_vectors = tfidf.fit_transform(all_posts["title"])

In [None]:
from sklearn.decomposition import NMF

num_topics = 20

nmf = NMF(n_components = num_topics)
nmf.fit(tfidf_vectors)

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

def wordcloud_topic_model_summary(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        freq = {}
        for i in topic.argsort()[:-no_top_words - 1:-1]:
            val = int(100000.0 * topic[i])
            freq[feature_names[i].replace(" ", "_")] = val+1
        wc = WordCloud(background_color="white", max_words=100, width=960, height=540)
        wc.generate_from_frequencies(freq)
        plt.figure(figsize=(12,12))
        plt.imshow(wc, interpolation='bilinear')
        plt.axis("off");
            
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        first_index = topic.argsort()[-1]
        print("Topic %s (%02d):" % (feature_names[first_index], topic_idx))
        print(" ".join(["'"+feature_names[i]+"'"
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [None]:
def topics_table(model, feature_names, n_top_words = 20):
    
    # Aufbau eines DataFrames für die Anzeige
    word_dict = {}
    num_topics = model.n_components
    
    for i in range(num_topics):
        
        # ermittle für jedes Topic die größten Werte
        # und füge die entsprechenden Worte im Klartext dem Dictionary hinzu
        words_ids = model.components_[i].argsort()[:-n_top_words-1:-1]
        words = [feature_names[key] for key in words_ids]
        word_dict['Topic #%2d' % i] = words;
    
    display(pd.DataFrame(word_dict))

In [None]:
wordcloud_topic_model_summary(nmf, tfidf.get_feature_names(), 40)

In [None]:
topics_table(nmf, tfidf.get_feature_names())

In [None]:
all_posts["month"] = all_posts["created_utc"].dt.strftime("%Y-%m")

In [None]:
from tqdm.auto import tqdm
import numpy as np
month_data = []
for month in tqdm(np.unique(np.unique(all_posts["month"]))):
    W_month = nmf.transform(tfidf_vectors[np.array(all_posts["month"] == month)])
    month_data.append([month] + list(W_month.sum(axis=0)/W_month.sum()*100.0))

In [None]:
topic_names = []
voc = tfidf.get_feature_names()
for topic in nmf.components_:
    important = topic.argsort()
    top_word = voc[important[-1]] + " " + voc[important[-2]]
    topic_names.append("Topic " + top_word)

In [None]:
df_month = pd.DataFrame(month_data, columns=["month"] + topic_names).set_index("month")
df_month.plot.area(figsize=(16,9))

## Klassifikation

Zunächst wird eine Menge von positiven und negativen Samples benötigt. Im originalen Python-Code sieht das so aus:

```python
pos = pd.read_sql("SELECT created_utc, nav AS title FROM toplevel_posts2020 p, nlp_posts np\
                   WHERE np.id=p.id AND (flair='AI' OR flair='Artificial Intelligence') AND \
                         created_utc>='2015-05-01'", sql, parse_dates=["created_utc"])
pos["target"] = 1

neg = pd.read_sql("SELECT created_utc, nav AS title FROM toplevel_posts2020 p, nlp_posts np\
                   WHERE np.id=p.id AND flair!='AI' AND flair!='Artificial Intelligence' AND flair IS NOT NULL AND \
                         created_utc>='2015-05-01'", sql, parse_dates=["created_utc"])
neg["target"] = 0

data = pd.concat([pos, neg.sample(n = len(pos), random_state=42)], 
                 ignore_index=True)
```

Hier laden wir stattdessen den `DataFrame` direkt ein:

In [None]:
import pandas as pd
data = pd.read_csv("https://github.com/datanizing/ix-reddit/raw/main/classification-data.csv.gz")

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,2), max_df=0.7, min_df=5, stop_words=stop_words)
count_vectors = cv.fit_transform(data["title"])

In [None]:
count_vectors.shape

TF/IDF-Vektoren berechnen.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True)
tfidf_vectors = tfidf.fit_transform(count_vectors)

In [None]:
X = tfidf_vectors
Y = data["target"].values

In [None]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='hinge', max_iter=1000, tol=1e-3, random_state=42)
clf.fit(X, Y)

In [None]:
Y_predicted = clf.predict(X)

In [None]:
from sklearn import metrics
metrics.accuracy_score(Y, Y_predicted)

In [None]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(Y, Y_predicted)
conf_mat

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(6, 4))
category_names = ["negative", "positive"]
sns.heatmap(conf_mat, annot=True, fmt="d", cmap="Blues", cbar=False,
            xticklabels=category_names, yticklabels=category_names)
plt.ylabel("Actual")
plt.xlabel("Predicted");

### Hold-out-Verfahren: Getrennte Mengen für Training und Test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

In [None]:
X.shape

X_train.shape
X_test.shape

Modell nur mit Trainingsdaten trainieren.

In [None]:
clf.fit(X_train, Y_train)

Ermittlung der Performance auf den Trainingsdaten selbst.

In [None]:
Y_predicted = clf.predict(X_train)

metrics.accuracy_score(Y_train, Y_predicted)

Ermittlung der Performance auf den Testdaten.

In [None]:
Y_predicted = clf.predict(X_test)

metrics.accuracy_score(Y_test, Y_predicted)

In [None]:
print(metrics.classification_report(Y_test, Y_predicted, target_names=category_names))

## Stichproben

In [None]:
Y_pred_all = clf.predict(tfidf.transform(cv.transform(data["title"])))

In [None]:
data["pred"] = Y_pred_all

In [None]:
data[data["pred"] != data["target"]][["title", "target", "pred"]]

# Prediction

In [None]:
all_posts["ai"] = clf.predict(tfidf.transform(cv.transform(all_posts["title"])))

In [None]:
all_posts["ai"].describe()

In [None]:
all_posts_m = all_posts.dropna(subset=["created_utc"]).set_index("created_utc").resample("M").agg({ "ai": "sum", "title": "count"})

In [None]:
all_posts_m["rel"] = all_posts_m["ai"] / all_posts_m["title"]

In [None]:
all_posts_m[["rel"]].plot(figsize=(16,9))

# Trend-Vorhersage

Ursprünglich erfolgte die Selektion mit folgendem Befehl:

```python
df = pd.read_sql("SELECT STRFTIME('%Y-%m-01', created_utc) AS month, flair, COUNT(*) AS count \
                  FROM toplevel_posts2020 \
                  WHERE created_utc>='2014-01-01' AND flair IN (SELECT flair FROM flairs WHERE count>1000) \
                  GROUP BY flair, month", sql, parse_dates=["month"])
```

Weil auch hier die Datenbasis zu groß ist, laden wir den `DataFrame` direkt ein:

In [None]:
df = pd.read_csv("https://github.com/datanizing/ix-reddit/raw/main/flairs-per-month.csv.gz")

In [None]:
past = df.pivot(index="flair", columns="month", values="count").fillna(0)

In [None]:
past

In [None]:
!pip install prophet

In [None]:
from prophet import Prophet

In [None]:
pa = pd.DataFrame()
pa["ds"] = past.columns
pa["y"] = past.loc["Business"].values
pa

In [None]:
m = Prophet()
m.fit(pa)

In [None]:
future = m.make_future_dataframe(periods=20, freq='M')

In [None]:
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

In [None]:
fig1 = m.plot(forecast)

In [None]:
fig2 = m.plot_components(forecast)