<h1>Data visualization</h1>

This Notebook focuses on visualizing the data we got so that it may be easier to understand them. In the long run it helps us solve different errors (like null entries) and make predictions about why certain Algorithms or Embeddings are not as useful as they seemed to be.

First we load the whole Data from the Database

In [None]:
import sqlite3
import pandas as pd

pd.set_option('display.max_colwidth', 200)

def df_from_database(database_path):
    con = sqlite3.connect(database_path)
    html_df = pd.read_sql_query("SELECT * FROM word_embeddings", con)
    con.close()
    return html_df

df = df_from_database("html.sqlite")

df["url"] = ["https://www.th-nuernberg.de/" + file.replace("data/htmlfiles/file_","").replace(">", "/").replace(".html", "") for file in df["filename"]]

print(df.dtypes)

In [None]:
print(f"Wir haben insgesamt {len(df)} verschiedene Webseiten gescraped")
print("2 Beispiele der Daten sind")
print(df.sample(2))

Der Text auf diese Webseiten ist dabei im Schnitt 

In [None]:
import matplotlib.pyplot as plt

df['length'] = df['text'].map(len)
df['length'].plot(kind='box', vert=False, figsize=(8, 1))
plt.show()
df[df["length"].eq(0) == False].plot(kind='hist', bins=50)
plt.show()

In [None]:
most_text = df.sort_values("length", ascending=False)
print(most_text["url"].head(10))

In [None]:
most_text[most_text["length"].eq(0) == False][100:].plot(kind='hist', bins=200)
plt.show()


TODO: links zu webseiten im Dataframe speichern
Die Seite mit dem meißten Inhalt (https://www.th-nuernberg.de/fakultaeten/bw/fakultaet/alumni/alumniportraits/) ist tatsächlich eine sehr Informative Webseite, bei der Porträts über ehemalige Absolvent*innen der Fakultät BWL zu sehen sind.

Webseiten mit keinem Inhalt ca. ein drittel der Webseiten.

In [None]:
df[df["length"].eq(0)].count()


In [None]:
least_text = df.sort_values("length", ascending=True).head(10)
print(least_text["url"])

Wordcloud mit allen Worten
TODO: sortiere häufigdte Wörter aus, lemmatisieren, deutsch-englisch

In [None]:
from matplotlib import pyplot as plt
from wordcloud import WordCloud
from matplotlib.colors import ListedColormap
from collections import Counter


def txa_wordcloud(list_of_words, bg_color="black", mask=None):
    # Define custom colors for the color map
    colors = ["#FF0000", "#FFCD00", "#FFD800", "#00FF00"]

    # Create a custom color map using ListedColormap
    colormap = ListedColormap(colors)

    wc = WordCloud(width=800, height=400, background_color=bg_color, colormap=colormap, max_words=200, mask=mask) 
    wc.generate_from_frequencies(Counter(list_of_words))

    # Plot
    fig = plt.figure(figsize=(20, 10))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.show()

whole_text = " ".join(df["text"])
whole_text = whole_text.replace("\n", "").lower()
whole_text_list = whole_text.split(" ")
txa_wordcloud(whole_text_list)



In [None]:
file_path = 'data/wordlist.txt'  
common_words = [] 
try:
    with open(file_path, 'r') as file:
        for line in file:
            common_words.append(line.strip())
except FileNotFoundError:
    print(f"The file '{file_path}' was not found.")
except Exception as e:
    print(f"An error occurred: {str(e)}")

print(common_words)
filtered_text_list = [word for word in whole_text_list if word not in common_words]
txa_wordcloud(filtered_text_list)

TODO: Anzahl an Seiten mit nicht-html content (xml, jpg, pdf ... )

In [None]:
xml_text = df.loc[ "xml" not in df["filename"] ]
print(least_text["url"])