# Day 2 -- Python for Researchers

## Today's Goals:
   * Learn how to open and read .txt files
   * Explore Python libraries, how to use them, and how to read documentation
   * Do some basic troubleshooting, using google searches, documentation, and AI queries 
   * Learn the basics of data cleaning for text files
   * Introduce some options for visualizing data through a matplotlib wordcloud

**Section 1: Finding and opening .txt files**

The best way to work with various files while programming is to ensure that file is in your *working directory*. Your working directory
is just another way to say the folder you are currently coding in. VSCode always pushes you to be coding inside a folder. 

You have a file named *ascii-text-art.txt* that is located in your working directory, below is the code you need to 
 open the file and read it using Python!

 You will need the path name for your file, right click the file and you have two options, both should work:
   * copy path (full file path)
   * copy relative path (relative, or shortened, file path)

Below we are using three functions:
   * open() -- built-in Python function, opens a file
   * read() -- built-in Python function, reads a file
   * print() -- built-in Python function, prints below the cell for user convenience 



In [None]:
#ASCII File

#r stands for "read"
path = r"YOUR PATH HERE"

with open(path, "r", encoding="utf-8") as f:
    content = f.read()
    print(content)

If you are ever unsure where exactly you are coding, you can use the os module getcwd() function to show you!
gwc means "get current working directory" -- https://www.w3schools.com/python/ref_os_getcwd.asp

It returns the file path of wherever you are coding.

In [None]:
import os
os.getcwd()

**Section 2: Doing some basic text cleaning**

We're now going to practice some basic text cleaning using a new .txt file. For this one, you need to download it yourself
and add it to your working directory. You can find the file here: https://drive.google.com/file/d/1Xoenw8hs84nBpkcETqJOC_MBkopEjamS/view?usp=sharing

First, you need to open the file, and then we'll get started:

In [None]:
path = r"bush_2002_sotu.txt"

with open(path, "r", encoding="utf-8") as f:
    bush_file = f.read()

We are going to be looking at the State of the Union (SOTU) address by Pres. George W. Bush from 2002. We've got one basic research question: 
*what are the most commonly referenced topics in Pres. Bush's SOTU address?* 

In order for us to begin to answer that question, we have to clean our dataset.

The most common thing we need to do is remove *stopwords*. These are words that are so commonly used that they are totally irrelevant for 
textual analysis and waste processing time to evaluate. You can create your own custom list of stopwords, aka words that you want to ignore in your dataset, but
most Python libraries for text analysis also include a pre-set list of stopwords.  

Original SOTU file from: https://georgewbush-whitehouse.archives.gov/news/releases/2002/01/20020129-11.html

In [None]:
#example of what stopwords are

import nltk
from nltk.corpus import stopwords 
stops = stopwords.words("english")

stops

Because we have this list of stopwords in the ntlk library, we are able to easily clean our data
using a series of methods from the ntlk library, some built-in Python functions, and a for loop.

Below are the functions/methods that we use:
   *  set() -- turns an item into a set. This is not a strictly necessary step, but it converts a list into a set, which is
       much faster to use than a list due to the way they're built in Python. 
   *  word_tokenize() -- this splits our dataset into tokens, aka smaller pieces. Tokenization is commonly used to make raw text
       into something a programming language can actually use. This breaks the sentences into a list of individual words & punctuation.
       You can see the results below on line 17 when we print the word_tokens variable
   *  lower() -- converts everything into lowercase, this is very common when handling strings as upper and lower case letters are
       fundamentally different in Python
   *  isalpha() -- checks to see if a string is alphabetical or not, returns a Boolean (False or True)
   *  append() -- this is used to append something to a list
   *  print() -- prints off code of our choosing; an extremely helpful tool for debugging but does nothing to the actual program

We will walk through some of these functions together and learn more about how to use them and understand what mandatory parameters they require. 

In [None]:
#cleaning speech of stop words & punctuation
#adapted code from https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
#optional step, converting into a set for faster processing
stop_words = set(stopwords.words('english'))
 
word_tokens = word_tokenize(bush_file)
cleaned_speech = []
 
for w in word_tokens:
    if w.lower() not in stop_words and w.isalpha():
        cleaned_speech.append(w.lower())
 
print(word_tokens)
print(cleaned_speech)

**Section 3: Building a WordCloud**

We can then incorporate the WordCloud using the wordcloud library and matplotlib library.

But uh-oh! The wordcloud library is depricated! This is a really common problem you can face when using libraries, and 
it's important to understand how to troubleshoot it. We'll walk through a few options together. 

Documentation:
   * Wordcloud: https://amueller.github.io/word_cloud/
   * matplotlib: https://matplotlib.org/



Once we've figured that out, our next goal will be for you to try and figure out on your own how we could
possibly save the wordcloud as a .png or .jpg file. 

Lastly, we only did very minimal data cleaning, so our wordcloud can be significantly improved. What are some limitations that you can spot? 

In [None]:
from wordcloud import WordCloud
from matplotlib import pyplot as plt
# !pip uninstall pillow -y
# !pip install pillow==9.5.0

text_to_plot = " ".join(cleaned_speech)

# create a WordCloud 
wordcloud = WordCloud(width=1800, height=1500, 
                      background_color="white", 
                      min_font_size=10).generate(text_to_plot)

# plot the WordCloud image
plt.figure(figsize = (5,5), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

#your code -- how can we save a wordcloud as an image using Python? 

plt.show()

**Section 4: Now you try!**

Your goal is to adapt the code above to create your own wordcloud. You will need to:
  * find a .txt file online (or perhaps you already have one you can use)
  * bring that .txt file into your working directory
  * do some very basic data cleaning
  * put it into a wordcloud!

If you finish with those steps, try reading the documentation for how you can mask a wordcloud
with an image and see if you can get your code working!: https://amueller.github.io/word_cloud/auto_examples/masked.html#sphx-glr-auto-examples-masked-py 