# Predict tags on StackOverflow with linear models

In this assignment you will learn how to predict tags for posts from [StackOverflow](https://stackoverflow.com). To solve this task you will use multilabel classification approach.

### Libraries

In this task you will need the following libraries:
- [Numpy](http://www.numpy.org) — a package for scientific computing.
- [Pandas](https://pandas.pydata.org) — a library providing high-performance, easy-to-use data structures and data analysis tools for the Python
- [scikit-learn](http://scikit-learn.org/stable/index.html) — a tool for data mining and data analysis.
- [NLTK](http://www.nltk.org) — a platform to work with natural language.

### Download the training, validation and test data (do this if the week1/data folder is empty)

In [1]:
import sys
sys.path.append("..")
from common.download_utils import download_week1_resources

download_week1_resources()

  0%|          | 0.00/7.20M [00:00<?, ?B/s]

  0%|          | 0.00/2.17M [00:00<?, ?B/s]

  0%|          | 0.00/1.04M [00:00<?, ?B/s]

  0%|          | 0.00/5.09k [00:00<?, ?B/s]

### Text preprocessing
For this and most of the following assignments you will need to use a list of stop words. It can be downloaded from nltk:

In [2]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anupam\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In this task you will deal with a dataset of post titles from StackOverflow. You are provided a split to 3 sets: *train*, *validation* and *test*. All corpora (except for *test*) contain titles of the posts and corresponding tags (100 tags are available). The *test* set is provided for Coursera's grading and doesn't contain answers. Upload the corpora using *pandas* and look at the data:

In [4]:
import pandas as pd
import numpy as np
from ast import literal_eval

def read_data(filename):
    data = pd.read_csv(filename, sep='\t')
    data['tags'] = data['tags'].apply(literal_eval)
    return data

train_data = read_data('data/train.tsv') 
val_data = read_data('data/validation.tsv')   
test_data = pd.read_csv('data/test.tsv', sep='\t')
train_data.head()

Unnamed: 0,title,tags
0,How to draw a stacked dotplot in R?,[r]
1,mysql select all records where a datetime fiel...,"[php, mysql]"
2,How to terminate windows phone 8.1 app,[c#]
3,get current time in a specific country via jquery,"[javascript, jquery]"
4,Configuring Tomcat to Use SSL,[java]


As you can see, *title* column contains titles of the posts and *tags* column contains the tags. It could be noticed that a number of tags for a post is not fixed and could be as many as necessary.

For a more comfortable usage, initialize *X_train*, *X_val*, *X_test*, *y_train*, *y_val*.

In [6]:
X_train = train_data['title'].values
y_train = train_data['tags'].values
X_val = val_data['title'].values
y_val = val_data['tags'].values
X_test = test_data['title'].values
print(X_train.shape, y_train.shape)

(100000,) (100000,)


One of the most known difficulties when working with natural data is that it's unstructured. For example, if you use it "as is" and extract tokens just by splitting the titles by whitespaces, you will see that there are many "weird" tokens like *3.5?*, *"Flip*, etc. To prevent the problems, it's usually useful to prepare the data somehow. In this task you'll write a function, which will be also used in the other assignments. 

**Task 1 (TextPrepare).** Implement the function *text_prepare* following the instructions. After that, run the function *test_text_prepare* to test it on tiny cases and submit it to Coursera.

In [7]:
import re

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
    """
        text: a string        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
    return text.strip()


In [8]:
def test_text_prepare():
    examples = ["SQL Server - any equivalent of Excel's CHOOSE function?",
                "How to free c++ memory vector<int> * arr?"]
    answers = ["sql server equivalent excels choose function", 
               "free c++ memory vectorint arr"]
    for ex, ans in zip(examples, answers):
        if text_prepare(ex) != ans:
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'    
    
print(test_text_prepare())

Basic tests are passed.


Run your implementation for questions from file *text_prepare_tests.tsv* to earn the points.

In [14]:
# Display full text in column
pd.set_option('display.max_colwidth', None)
df_test_text = pd.read_csv('data/text_prepare_tests.tsv', '\t', names=['test_text'])
df_test_text['test_text_processed'] = df_test_text['test_text'].apply(text_prepare)
df_test_text.head(10)


Unnamed: 0,test_text,test_text_processed
0,SQLite/PHP read-only?,sqlite php readonly
1,Creating Multiple textboxes dynamically,creating multiple textboxes dynamically
2,"that, self or me — which one to prefer in JavaScript?",self one prefer javascript
3,Save PHP date string into MySQL database as timestamp,save php date string mysql database timestamp
4,How I can fill my DropDownList with Data from a XML File in my ASP.NET Application,fill dropdownlist data xml file aspnet application
5,"Programmatically trigger a jQuery-UI draggable's ""drag"" event",programmatically trigger jqueryui draggables drag event
6,How to get the value of a method argument via reflection in Java?,get value method argument via reflection java
7,Knockout maping.fromJS for observableArray from json object. Data gets lost,knockout mapingfromjs observablearray json object data gets lost
8,"Facebook Connect from Localhost, doing some weird stuff",facebook connect localhost weird stuff
9,fullcalendar prev / next click,fullcalendar prev next click


Now we can preprocess the titles using function *text_prepare* and  making sure that the headers don't have bad symbols:

In [15]:
X_train = np.array([text_prepare(text) for text in X_train])
X_val = np.array([text_prepare(text) for text in X_val])
X_test = np.array([text_prepare(text) for text in X_test])
X_train.shape

(100000,)

For each tag and for each word calculate how many times they occur in the train corpus. 

**Task 2 (WordsTagsCount).** Find 3 most popular tags and 3 most popular words in the train data and submit the results to earn the points.

In [21]:
def get_word_count(data):
    word_count = {}
    for text in data:    
        for word in text.split():
            if word in word_count.keys():
                word_count[word] += 1
            else:
                word_count[word] = 1
    return word_count

train_word_freq = get_word_count(X_train)

def get_topn_dictitems_byvalue(dict_data, n):
    freq_word = [(value, key) for key, value in dict_data.items()]
    freq_word.sort(reverse=True, key=lambda k: k[0])
    return freq_word[:n]

print('3 most popular words in train data are:')
for item in get_topn_dictitems_byvalue(train_word_freq, 3):
    print(f"'{item[1]}' with a count of {item[0]}")


3 most popular words in train data are:
'using' with a count of 8278
'php' with a count of 5614
'java' with a count of 5501


In [23]:
def get_tag_count(tag_data):
    tag_count = {}
    for tags in tag_data:
        for tag in tags:
            if tag in tag_count.keys():
                tag_count[tag] += 1
            else:
                tag_count[tag] = 1
    return tag_count
    
tag_freq = get_tag_count(y_train)    
print('3 most popular tags in train data are:')
for item in get_topn_dictitems_byvalue(tag_freq, 3):
    print(f"'{item[1]}' with a count of {item[0]}")                

3 most popular tags in train data are:
'javascript' with a count of 19078
'c#' with a count of 19077
'java' with a count of 18661
