# 🌐 Wikipedia recommender system
- Zuzanna Gawrysiak 148255
- Agata Żywot 148258

### Description (straight from ekursy - to be deleted later)


The general task is to create a system that will recommend similar articles based on the previously visited articles.

**Input - Collection of articles (links or titles), Output - Collection of recommended articles (links or titles) with a "score"**


You will receive a grade for each of the following steps. The highest possible score without
finishing all parts is 4.0. For example, if you do perfectly the first two steps your grades will be 4.0,
4.0, 2.0.


**Crawling and scraping** - Download text from at least 1000 Wikipedia/fandom wiki articles.
(Scrappy is not a must)


**Stemming, lemmatization** - preprocess downloaded documents into the most suitable form for this
task. Store it as a .csv/parquet file or into a database.


**Similarities** - for a given collection of previously visited articles find the best matches in your
database and recommend them to the user


GUI not required, notebook or any other reasonable form will be accepted. I have to be able to
provide a list of articles in an easy way and receive a meaningful recommendation.
You have to send the source code and report.


Report:
- pdf or notebook
- explain each step of your algorithm, especially how you score articles
- present interesting statistics about your database (most frequent words, histograms, similarities
between documents, ...)
- show some examples of recommendations with explanations (I'd prefer graphical form - see
prediction breakdowns for example)

## Import necessary libraries

In [None]:
%pip install pyldavis
%pip install wikipedia

In [None]:
# SCRAPPING
import random
import linecache
import wikipedia
import re
import json

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import matplotlib.pyplot as plt
from scipy.spatial.distance import cosine
from wordcloud import WordCloud

import pyLDAvis
import pyLDAvis.sklearn

import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, wordpunct_tokenize

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

## Scraping wikipedia articles

In [None]:
N = 100
with open(r'./data/titles.txt', 'r') as fp:
    num_lines = sum(1 for line in fp)
    # print('Total lines:', num_lines) 

In [None]:
# get N article titles
random.seed(2137)
line_numbers = random.sample(range(1, num_lines), N)

titles = []
for i in line_numbers:
    x = linecache.getline(r'./data/titles.txt', i).strip()
    titles.append(x)
print(titles[:5])

In [None]:
def save_titles_content(titles):
    """
    Create JSON files with all page information necessary
    """
    for title in titles:
        try: 
            page = wikipedia.page(title)
        except:
            print(f'No page of title {title}!')
            continue 
        d = dict()
        d['title'] = page.title
        d['url'] = page.url
        d['body'] = re.sub(r'==.*?==+', '', page.content)
        d['links'] = page.links
        try: 
            d['images'] = page.images
        except:
            d['images'] = ''
         
        json_object = json.dumps(d, indent=2)
        file_title = re.sub(r'[\s]','_', page.title)
        file_title = re.sub(r'[,.-]','', file_title)
        
        with open(f'./data/pages_content/{file_title}.json', 'w') as outfile:
            outfile.write(json_object)

In [None]:
save_titles_content(titles)

## Stemming, lemmatization

In [None]:
def preprocess(article):
    """
    Tokenize given article, remove stopwords, numbers, then perform stemming
    """
    preprocessed = []
    porter = PorterStemmer()
    tokenized = word_tokenize(article)
    sw = stopwords.words('english')

    for word in tokenized:
        if word.isalpha() and word not in sw:
            preprocessed.append(porter.stem(word)) # stemming is faster than lematization, but has lower accuracy (can try both later)
    return ' '.join(preprocessed)

### Count vector
Store articles as numbers of occurences of words.

In [None]:
CountVec = CountVectorizer(ngram_range=(1,1), stop_words='english')
CountData = CountVec.fit_transform(articles.body)
 
CountData
# if dataset is too large, try: https://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick 

In [None]:
dfCV = pd.DataFrame(CountData.toarray(), columns=CountVec.get_feature_names_out(), index=articles.URL)
dfCV

## Database analysis
> present interesting statistics about your database (most frequent words, histograms, similarities between documents, ...)

### Most frequent words

In [None]:
word_sums = dfCV.sum(axis=0)
word_sums = word_sums.sort_values(ascending=False)
print(f"Top five most frequent words:\n{word_sums[:5]}")
word_sums[:10].plot(kind='bar', figsize=(12,8), title="Most frequent words", color='hotpink')

All words as a wordcloud

In [None]:
def generate_wordcloud(data):
    wc = WordCloud(width=1200, height=800, max_words=150, background_color='white', colormap='magma').generate_from_frequencies(data)
    plt.figure(figsize=(12,8))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    plt.show()

generate_wordcloud(word_sums)

### Similarities between documents

Check the similarities using LDA (Latent Drichlet Allocation). 

In [None]:
lda_tf = LatentDirichletAllocation(n_components=3, random_state=0) # number of articles
lda_tf.fit(CountData)

pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_tf, CountData, CountVec, mds='tsne')
panel

### Conclusions from analysis
* the most frequent word is ...
* bla bla

## TFIDF approach
Count vector was for showing some interesteing statistics, but for the recommendation we will use TFIDF.

In [None]:
tfidf = TfidfVectorizer(ngram_range=(1,1), use_idf=True, smooth_idf=False, stop_words='english') 
tfidf_data = tfidf.fit_transform(articles.body) 
dfTFIDF = pd.DataFrame(tfidf_data.toarray(), index=articles.URL, columns=tfidf.get_feature_names_out())
dfTFIDF

Save the obtained data frame to csv file.

In [None]:
dfTFIDF.to_csv('articles.csv')

## Similarities

In [None]:
query = "poznań"
query = preprocess(query)
query = tfidf.transform([query]).toarray()[0] 
1-dfTFIDF.apply(lambda x: cosine(x, query), axis=1).sort_values()