# Part 1

In this part, I downloaded a bunch of wikipedia articles, cleaned their contents, and stored them in postgreSQL tables. There are three mini-libraries that I used frequently in this part:
- download_from_wikipedia
- cleaner
- database_manager

First step is to select a category to download the corresponding articles.

In [5]:
category = 'machine learning'

## Downloading articles

Now we should use the `get_articles` function from `download_from_wikipedia` library to download the articles under `machine learning`. Every category has some sub-categories so we can also download articles under the sub-cats of machine learning. We can determine how deep the sub-category level can go.

I want to download around 1000 articles under machine learning so I have to find the proper `sub_cat_level`. **After passing the category and sub-category depth to the `get_articles` function, it will tell us how many articles are there and asks whether we want to continue to download or change the depth.**

since downloading 1000 articles takes some time, my function reports its progress after downloading every 200 articles.

In [6]:
from lib.download_from_wikipedia import get_articles

In [8]:
wiki_df = get_articles(category, sub_cat_level=2)

There are 1082 articles under category "machine learning" (subcategory depth = 2)
Enter "y" to download all articles. Otherwise, enter "n" and change the category and/or subcategory depth.
y
downloaded 200 articles so far...
downloaded 400 articles so far...
downloaded 600 articles so far...
downloaded 800 articles so far...
downloaded 1000 articles so far...
1082 articles downloaded successfully!


In [10]:
wiki_df.shape

(1082, 5)

## Text cleaning

Now that we have a dataframe of all machine learning article contents, we need to clean their texts. I used the `cleaner` library for this purpose.

Some of the wikipedia pages don't have any contents and some of them are User/Template/File pages, so first we have to get rid of those articles.

In [12]:
from lib.cleaner import df_cleaner, text_cleaner, title_cleaner
wiki_df = df_cleaner(wiki_df)
wiki_df.shape

(1074, 5)

After that, we will lemmatize the contents:

In [13]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [14]:
lemmatized_content_list = [' '.join([word.lemma_ for word in nlp(content)]) for content in wiki_df['content'].tolist()]

Now we call the `text_cleaner` function to clean the lemmatized contents. I've used RegExr to get rid of formulas (Latex), white spaces, numbers, and none-letters. We also need to clean the titles before writing them to postgres. Postgres doesn't like single quotes. Many of the titles do have a single quote so in order to put them in postgres we need to add another single quote to it.

In [15]:
clean_content_list = [text_cleaner(content) for content in lemmatized_content_list]

wiki_df['clean_content'] = clean_content_list
wiki_df['title'] = wiki_df['title'].apply(title_cleaner)

## Writing to database

I used my `database_manager` library in order to create a schema with three tables and store the wikipedia articles in three tables within that database. The tables are:
- **`articles`**: Primary key is page ID. Includes title and cleaned content for each page ID.
- **`categories`**: Primary key is category ID. Includes category names (or titles).
- **`article_category`**: Primary key is page (article) ID. Includes category ID for each articled.

In [16]:
from lib.database_manager import create_schema, query_to_dataframe, insert_to_db

In [17]:
create_schema()

Schema already exists!


### Writing to `articles` table

In [18]:
articles_id = query_to_dataframe('SELECT article_id FROM articles')

if len(articles_id) > 0:
    wiki_df = wiki_df[wiki_df['pageid'].apply(lambda p_id: p_id not in articles_id['article_id'].tolist())]
    
articles_to_insert = ', '.join(["({}, '{}', '{}')".format(r['pageid'], r['title'], r['clean_content'])\
                       for i, r in wiki_df.iterrows()])

if articles_to_insert != '':
    insert_to_db('INSERT INTO articles VALUES {}'.format(articles_to_insert))

### Writing to `categories` table

In [19]:
import wikipedia

category_id_to_insert = wikipedia.WikipediaPage(category).pageid
category_title_to_insert = title_cleaner(category)

try:
    insert_to_db("INSERT INTO categories VALUES ({}, '{}')".format(category_id_to_insert, category_title_to_insert))
except:
    pass
    

### Writing to `article_category` table

In [20]:
articles_cats_to_insert = ', '.join(["({}, {})".format(r['pageid'], category_id_to_insert)\
                                     for i, r in wiki_df.iterrows()])

if articles_cats_to_insert != '':
    insert_to_db('INSERT INTO article_category VALUES {}'.format(articles_cats_to_insert))