<a href="https://colab.research.google.com/github/cbroker1/text-as-data/blob/master/TAD_Week_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Welcome to Text as Data (Summer 2020)!

Below we will look at how to get set up with Quanteda. The process will be the very similar if you are using Google Colab, R Studio, or another development environment. There may be other packages or software you have to install depending on the OS you are using (Windows, OSX, Ubuntu, etc). Make sure to reach out immediately if you have problems. Sometimes setup issues are the hardest part of analysis! 

### Step 1.
  Let's install Quanteda.

In [0]:
# This particular step is only applicable to Linux and Colab. 
# Colab notebooks run on special Google flavor of Linux. Some R packages
# require special OS packages to be installed.
system("apt-get install libpoppler-cpp-dev")

In [0]:
install.packages('quanteda')
install.packages('readtext')

### Step 2. Now that we have Quanteda installed we can read in some data.



In [0]:
library(quanteda)
library(readtext)

In [0]:
# Read in the CSV using readtext. Straight forward.
nuclear_data = readtext('/content/sentiment_nuclear_power (1).csv', text_field = 'tweet_text')
readtext_corpus = corpus(nuclear_data)


# Read in CSV using basic read.csv
# We will need to update a few fields and set options. 
# See for additional info: https://quanteda.io/reference/corpus.html
nuclear_data_csv = read.csv('/content/sentiment_nuclear_power (1).csv',
                            header=TRUE, # We have headers so set to TRUE
                            stringsAsFactors = FALSE) # Make sure text is retained as char
# Quanteda requires some document name. In case of a CSV or similar format
# we will assign integers as document IDs.
nuclear_data_csv$doc_id = seq.int(nrow(nuclear_data_csv))
# Create a corpus. Use 'text_field' to identify the right text column
csv_corpus = corpus(nuclear_data_csv, text_field = 'tweet_text')


### Step 3. Review the corpus.


In [0]:
# Let's look at the summary of our corpus.
# Pick either CSV or readtext.
summary(readtext_corpus)

In [0]:
# Find and view some text in our corpus
texts(readtext_corpus)[2]

In [0]:
# Check out the metadata for a specific document
summary(readtext_corpus[5])

Unnamed: 0_level_0,Text,Types,Tokens,Sentences,sentiment,sentiment_confidence_summary
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<chr>,<chr>
1,sentiment_nuclear_power (1).csv.5,26,32,1,Neutral / author is just sharing information,"""""Neutral / author is just sharing information"""": 1.0"


In [0]:
# Look at only negative sentiment tweets.
# Use corpus_subset to select only tweets with negative sentiment.
corpus_subset(readtext_corpus, sentiment == 'Negative')

In [0]:
# We can add additional metadata to our corpus
docvars(readtext_corpus, 'course') = 'Text as Data'
# If we provided a list to above docvars we would have document level metadata

# Look at only subset of the summary with our new metadata/docvar
summary(readtext_corpus, 5)

Unnamed: 0_level_0,Text,Types,Tokens,Sentences,sentiment,sentiment_confidence_summary,course
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<chr>,<chr>,<chr>
1,sentiment_nuclear_power (1).csv.1,13,13,1,Negative,"""""Neutral / author is just sharing information"""": 0.2 """"Negative"""": 0.8",Text as Data
2,sentiment_nuclear_power (1).csv.2,15,15,2,Neutral / author is just sharing information,"""""Neutral / author is just sharing information"""": 1.0",Text as Data
3,sentiment_nuclear_power (1).csv.3,18,18,3,Neutral / author is just sharing information,"""""Neutral / author is just sharing information"""": 0.667 """"Negative"""": 0.333",Text as Data
4,sentiment_nuclear_power (1).csv.4,25,29,1,Neutral / author is just sharing information,"""""Neutral / author is just sharing information"""": 1.0",Text as Data
5,sentiment_nuclear_power (1).csv.5,26,32,1,Neutral / author is just sharing information,"""""Neutral / author is just sharing information"""": 1.0",Text as Data


### Step 4. Create the DFM.

We will take an additional look at the DFM next week. This week we will use it 
to generate a word cloud.

In [0]:
dfm_nuclear = dfm(readtext_corpus)

In [0]:
dfm_nuclear

Document-feature matrix of: 190 documents, 1,220 features (98.4% sparse) and 3 docvars.
                                   features
docs                                : hello japan is a nuclear power plant
  sentiment_nuclear_power (1).csv.1 1     1     1  1 1       1     1     1
  sentiment_nuclear_power (1).csv.2 0     0     0  1 0       2     0     0
  sentiment_nuclear_power (1).csv.3 0     0     0  0 0       1     1     0
  sentiment_nuclear_power (1).csv.4 1     0     0  0 0       1     1     1
  sentiment_nuclear_power (1).csv.5 2     0     0  0 0       1     0     0
  sentiment_nuclear_power (1).csv.6 0     0     0  0 0       1     1     0
                                   features
docs                                crisis .
  sentiment_nuclear_power (1).csv.1      1 1
  sentiment_nuclear_power (1).csv.2      0 1
  sentiment_nuclear_power (1).csv.3      0 1
  sentiment_nuclear_power (1).csv.4      0 1
  sentiment_nuclear_power (1).csv.5      0 3
  sentiment_nuclear_power (1)

In [0]:
tes = c(Mao = "Political power grows out the barrel of a gun.",
 Kanye = "No one man should have all that power.",
 Saying = "The quick brown fox jumped over the lazy dog.")

In [0]:
tec = corpus(tes)

In [0]:
dfm(tec)

Document-feature matrix of: 3 documents, 24 features (61.1% sparse).
        features
docs     political power grows out the barrel of a gun .
  Mao            1     1     1   1   1      1  1 1   1 1
  Kanye          0     1     0   0   0      0  0 0   0 1
  Saying         0     0     0   0   2      0  0 0   0 1
[ reached max_nfeat ... 14 more features ]