# [PACS-190] Tabular Text Analysis Playground


### Professor Matthew Specter

This notebook provides template code for starting text processing, exploration, and analysis on your own tabular data. If your data is formatted in a table contained in a `.csv`, Google Sheet, or Microsoft Excel file, this notebook may be appropriate.

---



**Dependencies:**

In [None]:
# Run this cell first to load necessary programs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import re
from sklearn.feature_extraction.text import CountVectorizer
!pip install textblob
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfTransformer
import csv
pd.set_option('max_colwidth', 280)
plt.style.use('fivethirtyeight')
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

stop_words = ENGLISH_STOP_WORDS

## Uploading Your Data  to DataHub<a id='section 1'></a>

1. Make sure your data is saved to your computer as a `.csv` file. 
    - For Google Sheets, click the "File" menu, choose "Download As", and select "Comma-separated values (.csv)"
    - For Microsoft Excel, click the "File" menu and choose "Save As". In the dropdown menu below where you are asked to input the file name, choose "CSV UTF-8 (Comma delimited) (\*.csv)"
2. Go to the Jupyter dashboard ([datahub.berkeley.edu](datahub.berkeley.edu): the place where you see all your datahub folders- if you're in a notebook, you can get there by clicking on the jupyter logo at the top left). Click on the PACS-190 folder, then the data folder. 
3. Once you're in the PACS/data folder, click on the `upload` button at the top right, then upload your data file.

## Loading Your Data into the Notebook <a id='section 2'></a>

Once your file is uploaded to Datahub, fill in the ellipses in the cell below with the name of your file. Remember to **include the filename extension** (e.g. use "my_file.csv" instead of "my_file").

Run the cell to load your data and save it to the variable `data`.

In [None]:
# load a text file
data = pd.read_csv("data/...")

## Explore Your Data <a id='section 3'></a>

Use what you've learned in the module to explore your data. Some commonly helpful function calls have been provided below.

<div class="alert-info">
    **NOTE**: your data will likely need some cleaning or preprocessing to prep it for analysis. Since every data set and analysis is different, we cannot provide you a template for this step. If you have questions about how to preprocess your data, please drop in for [Peer Consulting](https://data.berkeley.edu/education/peer-consulting) or make an appointment at the [DLab](http://dlab.berkeley.edu/consulting).
</div> 

In [None]:
# show the first five rows of your table
data.head()

In [None]:
# get some summary statistics for your data
data.describe()

In [None]:
# replace the ellipses with the name of the column that has the text data
# to get a list of the texts
texts = data["..."]

In [None]:
# clean the texts as needed
# NOTE: cleaning is very specific to the types of documents you are working with
# Cleaning often involves regex


In [None]:
# create a term-document frequency matrix for your texts
cv = CountVectorizer(stop_words=stop_words)
dtm = cv.fit_transform(texts)
tt = TfidfTransformer(norm='l1',use_idf=False)
dtm_tf = tt.fit_transform(dtm)
dtm_tf

In [None]:
# get some word counts
Counter(tokens_no_stops).most_common()

# Resources

- [The Berkeley DLab](http://dlab.berkeley.edu/): introductory coding and data analysis workshops, individual data science consulting conducted by grad students and staff. Most by appointment/preregistration only. In Barrows Hall.
- [Peer Consulting](https://data.berkeley.edu/education/peer-consulting): drop-in office hours for introductory-level data science help, including debugging and concepts. First floor Moffitt Library.

---
Notebook developed by: Keeley Takimoto (@ktakimoto on github)

Data Science Modules: http://data.berkeley.edu/education/modules
