# Python for Researchers

## Day Four Objectives

*   learn and use Google Colab for writing Python
*   practice a text analysis workflow using data pulled from the Journal of Language in Society
*   review Pandas and dataframes
*   review cleaning and working with textual data
*   learn TF-IDF for preliminary analysis of a corpus



## Section One: Our Data

The journal Language in Society is an international journal in sociolinguistics. Our dataset contains the metadata of journal articles, reviews and other matter from 1980-2018. The data was pulled directly from JSTOR using Constellate Ithaka's dataset and analysis tool. 

We are using this data set because: it has already been gathered, cleaned and tested and it came from a reliable source.

[Pull the dataset from here](https://drive.google.com/file/d/1qOBGbm7f8g3Hkb4deOAJmXkXrh2b7D37/view?usp=sharing)

### Create a dataframe 

### Explore the data

## Section Two: Cleaning Our Data

First we'll drop everything we don't need and focus only on the columns that have relevant information. 

Let's say we want to explore our 'titles' column and look at what words and topics appear in titles the most over time. 

First, we'll drop the columns that don't have any information in them. 

## TF-IDF Analysis

TF-IDF stands for Text Frequency-Inverse Document Frequency 

TF- is basically a normalized word count, so total number of times a word appears in a doc / number of words in that doc 

TF-IDF - is the significance of a words normalized across an entire corpus, so a words 'importance' relative to how frequently it appears in an article and the corpus as whole (words that show up in many articles are less important -those that show up in fewer articles are more important)

A common example is to think about how you might choose a place to eat in a new city. Since you are in a new place, not only do you want to go to a *good* place, you want to do to a place that is *unique*. Local reviews and reccomendations might point you towards what is good, but you'd need an additional tool or step to find out, among those places, which offers unique cuisine. 

In text analysis, some words might appear an entire set of works. For example, in almost every Steven King novel someone is wearing a blue chambray shirt. 'Chambray' would not be a *unique* word in King's works. However, if you were analyzing a corpus that had mulitple horror authors in it, chambray would (probably) show up as a word that is *uniquely* King's. 

We are going to analyze titles from our metdata and see if we can pull any unique words (this is not necessariyl ideal, we'd want to work with a much larger corpus, but this will give an idea of the workflow). 

## Your Turn 

Go back and only select articles that are research articles-not reviews, and do the analysis again, what changes? 

First, use your Pandas knowledge and knowledge of comparators (==, !=, etc) to create code that will drop all rows that do not contain 'research-article' in the 'docSubType' category. 

Now, you'll need to only select the columns you need to do the analysis just like we did before

And now, run our four blocks of code to create our TF-IDF dataset

Finally, let's show the most frequent words by year-just like we did before. 

### LDA 

Our next type of text analysis is LDA or Latent Dirlicht Allocation. LDA is used for topic modeling, or, for taking a set of texts and generating (based on probability) common topics in that set of data. LDA will also then show which topics are most associated with which document in the given dataset. 

How? First LDA passes all of the text into a bag of words, meaning order and context are ignored, only the words counts matter and how often words appear in a document with other words matter. 

For example, if we were working with movie reviews and generating topics from those reviews we might expect to see topics that look like: 
* ghost
* house
* woods

* new york
* love
* romance

* puppy
* couple
* children

These groups of topics would apply to multiple reviews and show common themes and genres of film represented in the dataset. 

For our data, we are interested in seeinf if our titles hold any prevelant information about what kind of topics scholars submitting to the journal were discussing. 

In [None]:
pip install pyldavis

In [None]:
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords

# Import the LDA items from scikit-learn
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import *

import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

#### pyLDAvis

There is a very nice visualization for LDA called pyLDAvis, but it can be a little deceiving to interpret. 

The blue bars represent saliency, or the frequency of terms in the corpus 

If you hover over one of the circles, the bars will change. You'll see a new set of words and new red bars. Here, the blue bars still show the term's frequency, but now *saliency* means a little more because these terms are not only frequent, but the ones that were found most relevant to one another in terms of defining the topic. 

In other words, the top words shown in the topic are the words the model grabbed onto the most to make that topic. The red bar shows the frequency of that word within the given topic. 

The circles are also showing us some information. The size of circles shows how many articles (or titles) in our case fit into that topic. The distribution of circles shows topics relative relationship to one another. Circles that are closer together are more related than circles that are further apart. 

## Your turn 

[Using the following dataset practice doing LDA.](https://drive.google.com/file/d/1gwn7x2ZDLOnwVrJp_WzQGQBLACRbpX0V/view?usp=sharing)

The data represents the full text of the articles pulled from the journal Language in Society, although it's been broken down into unigrams and then spliced back together in order to get accurate word counts for articles. 

Use the 'text' column to perform the analysis in the same way we did with our 'title' column