# Fundamentals of Analyzing Text
## Characterizing the R and Python Communities



In [1]:
import pandas as pd 
from bs4 import BeautifulSoup
import re
import codecs
import requests
import wordcloud
import numpy as np 
from collections import Counter
import itertools

import matplotlib.pyplot as plt 
import seaborn as sns

In [2]:
# This pulls Snorre's expore_regex script from github 
# useful to set up an explorer that can be used to iteratively
# find good regular expressions
#
# use the module to develop the pattern and finally use the .report() method to document your process. 

with open('explore_regex.py','w') as f:
    f.write(requests.get('https://raw.githubusercontent.com/snorreralund/explore_regex/master/explore_regex.py').text)
    
import explore_regex

In [3]:
# LOAD DATA
# Load overflow data
overflow_df = pd.read_csv('Posts_overflow.csv').dropna(subset = ['Body'])
# Load the datascience data. 
datascience_df = pd.read_csv('Posts_ds.csv').dropna(subset = ['Body'])
# load stats data
stats_df = pd.read_csv('Posts_stats.csv').dropna(subset = ['Body'])

**Ex.11.1.1** _Cleaning_
First task is as with any other data experience (especially text data): It needs Cleaning. Here we need to brush up on our string and scraping fundamentals: Regular expressions and BeautifulSoup.

Before we apply BeatifulSoup to get the clean text from HTML. We need to do some "pre-preprocessing".
We shall take advantage of the HTML tags to extract code from text.

Extract the code segments using the HTML tag: "<\CODE>", and put it into a separate columns for later analysis.

- Design a regular expression that match anything inside a `<code></code>` html-tag. 
> _Hint:_ inspect some of the data to see exactly how the code is wrapped in the html.
- Use it to extract it to another column.
- Use it to remove the code from the Body column.
    
        

In [16]:
# [Answer to ex. 11.1.1]

In [17]:
# [This question will be in assignment 4]

**Ex 11.1.2** Extract comments within the code using regular expressions matching everything between a **#** and a **\n** 

Later we shall use this to analyze properties of the comment in relation to the code (e.g. number of comments,  length). Specifically we need to define a regular expressions that extracts comment text from within the code segments, by matching anything after # until a newline.

In [6]:
# [Answer to ex. 11.1.2]

In [7]:
comment_str = '#.*$'
comment_regex = re.compile(comment_str, flags = re.DOTALL|re.UNICODE)

overflow_df = extract_from_to_column(overflow_df, comment_regex, from_col = 'code', to_col = 'comments')
datascience_df = extract_from_to_column(datascience_df, comment_regex, from_col = 'code', to_col = 'comments')
stats_df = extract_from_to_column(stats_df, comment_regex, from_col = 'code', to_col = 'comments')


**Ex. 11.1.3**
Now that the code is removed we can now use BeautifulSoup to Extract the text from the HTML.
> *Hint:* To use BeautifulSoup to extract the text from the Body columns use  `.get_text()`.

Inspect if the results are good by sampling a few examples and skimming them. Wrap it all in a function and run it on all three datasets.

In [8]:
# [Answer to ex. 11.1.3]

In [9]:
def get_text(html):
    soup = BeautifulSoup(html, 'lxml')
    text = soup.get_text()
    return text

def extract_text(df):    
    df['text'] = df.Body.apply(lambda x: get_text(x))    
    return df 

overflow_df    = extract_text(overflow_df)
datascience_df = extract_text(datascience_df)
stats_df       = extract_text(stats_df)

**Ex.11.1.4** Finally Extract the individual tags from the Tags columns using regular expressions. 
- Design a regular expression that matches the characters and symbols within the <> brackets. <[symbolsgoeshere]>, and assign it to a column named `tags_l` that holds a list of tags. 
- Count the Tags and Visualize them using the WordCloud package (https://github.com/amueller/word_cloud ). See this simple example: https://github.com/amueller/word_cloud/blob/master/examples/simple.py 
> *Hint:* use the method: .generate_from_frequencies(). That takes a dictionary of strings and their counts as input.

**Extra:** 
Visualize two tag sets:
    - one that co-occur with the <r> tag
    - and another for the <python>.


In [10]:
# [Answer to ex. 11.1.4]

In [1]:
# [This question will be in assignment 4]

# Part 2: Filtering the data
All analyses fundamentally depend on the way that the data was sampled. In this case I constructed a "rough" dataset, using a greedy matching of all R and Python patterns in the post and tags of the stackoverflow forums. Especially the matching of R was not completely unproblematic, and needs a re-iteration. 

There are two strategies here:
- Work on a better Regular expression to match all instances of the R programming language and exclude false, if possible. 
- Or use the Tags column as a strategy for delimiting your data, using the userdefined Tags as qualifiers. Here it is about locating and delimiting which tags to include.
    
We will work on the latter strategy, but choices here could essentially have a profound effect on our results and could be part of a sensitivity analysis.

A simple solution would be to Filter only Posts and "children" with tagged with 
- `<r>`
- `<python>`

However an initial analysis of the tagging behavior suggests that other programs and individual R packages are also used.

To expand the tags for our sampling strategy we can use the same methodology as we shall use later for finding phrases and colocations. 
Here we shall use the PMI measure of association to expand our seeds.

<br>

**Ex.11.2.1** Locate Tags to be used for delimiting our population using the PMI measure of association between all tags co-occurring with the <r> tag. 

- Use the tag_l column and count all co-occuring tags.
- *Hint:* You can use itertools.combinations(tag_list,2) to iterate through co-occurring pairs and count them. 

Although we could use a networkx undirected graph as datastructure, this is a little overkill when not doing more advanced network analysis. Instead use the builtin Counter module (collections.Counter) and count the tuple pair (tag,tag2). As we are not interested in the direction(although sequence might tell you something), remember to sort the tags before counting them.

In [12]:
# [Answer to ex. 11.2.1]

In [13]:
def count_colocations(df):
    tag_collocations = Counter()

    for tags in df.tags_l:
        for pair in itertools.combinations(tags,2):
            # sort the pair
            s_pair = sorted(pair)
            tag_collocations[tuple(s_pair)] +=1
        
    return tag_collocations

coloc_overflow = count_colocations(overflow_df)
coloc_ds = count_colocations(datascience_df)
coloc_stats = count_colocations(stats_df)

**Ex 11.2.2** Define a function with 4 arguments (token_count1,token_count2,cooc, N): Where N is the number of tag occurences in the corpus, and cooc is the co-occurence count. Use this input to compute the following formula:
$$\operatorname{pmi}(x ; y) \equiv \log \frac{p(x, y)}{p(x) p(y)}$$
- Compute the PMI of all tag pairs with the the `<r>` tag present.
- Inspect the top and the bottom to see how well the measure does.
- Define a variable r_tags as a set() of tags learned by inspecting the top 50 pmi scored tags.   
- Finally make it re-usable by wrapping the whole thing in a function that takes a list of tokens as input, and returns counters of individudual tokens, and co-occurring tokens along. 

In [14]:
# [Answer to ex. 11.2.2 here]

In [15]:
def get_pmi(token_count_1, token_count_2, cooc, N):
    co_p = cooc / N
    p = token_count_1 / N
    p2 = token_count_2 / N
    return np.log2(co_p/(p*p2))


def calculate_pmi_to_tag(df, tag = 'r'):
    colocation_count = count_colocations(df)
    overall_count = count_tags(df)
    
    # Get only the pairs containing `tag`
    partners = [(sorted(pair,key=lambda x: x== tag )[0] ,count) for pair,count in colocation_count.items() if tag in pair]

    # Count the total number of tag-pairs
    N_tags = sum(overall_count.values())
        
    # Total number of tags of `tag`
    n = overall_count[tag]
    
    pmis = list()
    
    for tag2, co in partners:
        n2 = overall_count[tag2]
        pmi = get_pmi(n, n2, co, N_tags)
        pmis.append([pmi, n2, co])

    collocations = pd.DataFrame(pmis,columns=['pmi','count','co_count'],index=[i for i,j in partners])
        
    return collocations


r_collocations = calculate_pmi_to_tag(overflow_df, tag = 'r')

In [16]:
r_collocations = r_collocations[~(r_collocations['count']==r_collocations['co_count'])]
r_collocations.sort_values('pmi', ascending=False)

Unnamed: 0,pmi,count,co_count
ggplot2,4.242703,145,143
r-markdown,4.137210,12,11
knitr,4.125237,11,10
subset,4.125237,22,20
legend,4.092816,9,8
leaflet,4.070096,8,7
bar-chart,4.070096,8,7
summary,4.040348,7,6
confidence-interval,4.040348,7,6
data-analysis,3.999706,6,5
