# [PACS-190] Text Analysis Playground


### Professor Matthew Specter

This notebook provides template code for starting text processing, exploration, and analysis on your own text data. If your data is text contained in a `.txt`, Google Doc, or Microsoft Word file, this notebook may be appropriate.

---



**Dependencies:**

In [None]:
# Run this cell first to load necessary programs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import re
from string import punctuation
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from collections import Counter

## Uploading Your Data  to DataHub<a id='section 1'></a>

1. Make sure your data is saved to your computer as a `.txt` file. 
    - For Google Docs, click the "File" menu, choose "Download As", and select "Plain Text (.txt)"
    - For Microsoft Word, click the "File" menu and choose "Save As". In the dropdown menu below where you are asked to input the file name, choose "Plain Text (\*.txt)"
2. Go to the Jupyter dashboard ([datahub.berkeley.edu](datahub.berkeley.edu): the place where you see all your datahub folders- if you're in a notebook, you can get there by clicking on the jupyter logo at the top left). Click on the PACS-190 folder, then the data folder. 
3. Once you're in the PACS/data folder, click on the `upload` button at the top right, then upload your data file.

## Loading Your Data into the Notebook <a id='section 2'></a>

Once your file is uploaded to Datahub, fill in the ellipses in the cell below with the name of your file. Remember to **include the filename extension** (e.g. use "my_file.txt" instead of "my_file").

Run the cell to load your data and save it to the variable `data`.

In [None]:
# load a text file
with open("data/...", "r") as f:
    data = f.read()

## Explore Your Data <a id='section 3'></a>

Use what you've learned in the module to explore your data. Some commonly helpful function calls have been provided below.

<div class="alert-info">
    **NOTE**: your data will likely need some cleaning or preprocessing to prep it for analysis. Since every data set and analysis is different, we cannot provide you a template for this step. If you have questions about how to preprocess your data, please drop in for [Peer Consulting](https://data.berkeley.edu/education/peer-consulting) or make an appointment at the [DLab](http://dlab.berkeley.edu/consulting).
</div> 

In [None]:
# get the number of characters in your data
len(data)

In [None]:
# define a function to tokenize your text
def tokenize(text):
    no_punc = ''.join([c for c in text if c not in punctuation])
    no_punc_lower = no_punc.lower()
    no_punc_lower_tokens = no_punc_lower.split()
    return no_punc_lower_tokens

In [None]:
# get tokens for your text
tokens = tokenize(data)
tokens

In [None]:
# remove stop words
tokens_no_stops = []

for t in tokens:
    if t not in ENGLISH_STOP_WORDS:
        tokens_no_stops.append(t)

tokens_no_stops

In [None]:
# get some word counts
Counter(tokens_no_stops).most_common()

# Resources

- [The Berkeley DLab](http://dlab.berkeley.edu/): introductory coding and data analysis workshops, individual data science consulting conducted by grad students and staff. Most by appointment/preregistration only. In Barrows Hall.
- [Peer Consulting](https://data.berkeley.edu/education/peer-consulting): drop-in office hours for introductory-level data science help, including debugging and concepts. First floor Moffitt Library.

---
Notebook developed by: Keeley Takimoto (@ktakimoto on github)

Data Science Modules: http://data.berkeley.edu/education/modules
