# Analysis 2
## Verbal Corpus of the Dakota Access Pipeline
This is the second of 3 analyses of corpus linguistic data related to ecological themes. This corpus consists of articles and webpages related to the Dakota Access Pipeline (DAP). The data was collected manually using a search engine. Results were saves as separate txt files. 

Link to raw data (on GitHub):

[Analysis 2](https://github.com/craigmateo/multilevel_corpus/tree/master/Analysis_2)

### Pre-processing

The code below reads the all the *txt* files and does preprocessing on the text. The preprocessing consists of:

1. **Noise removal** (removal of punctuation, special characters, digits)
2. **Normalization** (stemming, lemmatization, removal of stopwords) 

Exceprts from the corpus are then printed.

In [2]:
import pandas as pd
import glob

total=0

txt_files = glob.glob("C:\\Users\\Craig\\Documents\\GitHub\\multilevel_corpus\\Analysis_2\\corpus\\*.txt")
raw_lines = []


#txt_files = glob.glob("*.txt")
for filename in txt_files:
	with open(filename, "r", encoding="utf-8") as f:
		x = f.readlines()
		for line in x:
			raw_lines.append(line)

# Libraries for text preprocessing

import re
import nltk

#nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer

#nltk.download('wordnet') 

from nltk.stem.wordnet import WordNetLemmatizer

##Creating a list of stop words

stop_words = set(stopwords.words("english"))

corpus_PRE = []
corpus = []

for i in range(0, len(raw_lines)):
    
    #Remove punctuation
    #text = re.sub('[^a-zA-Z]', ' ', raw_lines[i])
    
    #remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ", raw_lines[i])
    
    #Convert to lowercase
    text = text.lower()
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    corpus_PRE.append(text)
    
    ##Convert to list from string
    text = text.split()
    
    ##Stemming
    ps=PorterStemmer()
    
    #Lemmatisation
    lem = WordNetLemmatizer()
    text = [lem.lemmatize(word) for word in text if not word in  
            stop_words] 
    text = " ".join(text)
    corpus.append(text)

text = " ".join(corpus)

print("Excerpt:" + "\n\n" + text[1100:1200])

Excerpt:

ric preservation asked army corp engineer conduct formal environmental impact assessment issue envir


### Word Count
One count is taken with only noise removal and another with both noise removal and normalization. 

In [3]:
print("\n\t\t\t" + '****** Word Count ******')  

textPre = " ".join(corpus_PRE)

num_words_PRE = format(len(textPre.split()),",")

num_words = format(len(text.split()),",")

print("\n" + 'Noise Removal:' + "\n")  

print(str(num_words_PRE))

print("\n" + 'Noise Removal & Normalization:' + "\n")  
print(str(num_words))



			****** Word Count ******

Noise Removal:

291,856

Noise Removal & Normalization:

174,974


## Quotations
Quotations are extracted from the corpus using regular expression matching. 500 characters before and after each quotation are also extracted so, for each quote, the context as well as the speaker could be identified. 

In [4]:
quotesList = []
quoteContext = []
total=0

raw_text = []
for line in raw_lines:
    raw_text.append(line)
alltext = " ".join(raw_text)
quotes = re.findall(r'"(.*?)"', alltext)
for i in quotes:
    ind = alltext.index(i)
    start=ind-500
    end=(ind+len(i))+100
    context=alltext[start:end]
    context = context.replace("\n","")
    quotesList.append(i)
    quoteContext.append([i,context])
    total=total+1

print("\n" + 'Total Quotations:' + "\n") 
print(total)

print("\n" + 'Sample of Quotation (with context below):' + "\n") 


print('"' + quotesList[0] + '"' +"\n")
print(quoteContext[0][1])


Total Quotations:

1101

Sample of Quotation (with context below):

"reshaping the national conversation for any environmental project that would cross the Native American land."

ny in the Standing Rock tribe considered the pipeline and its intended crossing of the Missouri River to constitute a threat to the region's clean water and to ancient burial grounds. In April 2016, Standing Rock Sioux elder LaDonna Brave Bull Allard established a camp as a center for cultural preservation and spiritual resistance to the pipeline; over the summer the camp grew to thousands of people.  The protests drew considerable  national and international attention and have been said to be "reshaping the national conversation for any environmental project that would cross the Native American land."[5] The U.S. Army Corps of Engineers had conducted a limited review of the route and found no sign


### Manual Cleaning
The initial list of 1101 quotes was manually cleaned by removing noise and lines that were obviously not spoken quotations. Duplicates were also removed. The result was a list of 660 quotes.

In [5]:
import pandas
colnames = ['QUOTE', 'REMOVE']
data = pandas.read_csv("C:\\Users\\Craig\\Documents\\GitHub\\multilevel_corpus\\Analysis_2\\remove.csv", names=colnames)
quotes = data.QUOTE.tolist()
remove = data.REMOVE.tolist()

cleanQuotes = []
totalRemoved = 0

for q in quotes:
    ind = quotes.index(q)
    if remove[ind]!="X":
        if q not in cleanQuotes:
            cleanQuotes.append(q)
    else:
        totalRemoved=totalRemoved+1

print("Original number: " + str(len(quotes)))
print("Number removed: " + str(totalRemoved))
print("Final list (deduped): " +str(len(cleanQuotes)))


Original number: 988
Number removed: 255
Final list (deduped): 660


### Grouping
The 660 quotes were then further reduced manually and qualitatively. Similar quotes (i.e. similar themes/speakers) were removed. Also very short or one-word quotes were generally removed. The 100 or so remaining quotes were then separated into one of three groups:

* Group A: proponents who either actively voices support for the pipeline (e.g., company representatives) or took a legal or institutional stand against the pipeline protesters (e.g., law enforcement)
* Group B: protesting opponents of the pipeline, most notably the affected Indigenous peoples, but also others who came to Standing Rock, North Dakota to voice opposition
* Group C: supporters and allies of protesters, such as NGOs and politicians who spoke out against the pipeline/in support of protesters

In [6]:
colnames = ['Quote', 'Speaker','Group']

data = pandas.read_csv("C:\\Users\\Craig\\Documents\\GitHub\\multilevel_corpus\\Analysis_2\\grouped.csv", names=colnames)

quotes = data.Quote.tolist()
speakers = data.Speaker.tolist()
groups = data.Group.tolist()

print("total quotes: " +str(len(quotes)-1))

data.head() 
  

total quotes: 92


Unnamed: 0,Quote,Speaker,Group
0,Quote,Speaker,Group
1,Protesters' escalated unlawful behavior this w...,Morton County Sheriff's Department,A
2,...damage caused after protesters set numerous...,Morton County Sheriff's Department,A
3,[The police said the protesters had been] very...,Morton County Sheriff's Department,A
4,...multiple archaeological studies conducted w...,"Kelcy Warren, CEO of Energy Transfer Partners",A


### Keywords

In [8]:
groupA = []
groupB = []
groupC = []

groupAt = []
groupBt = []
groupCt = []

for i in range(0,len(quotes)):
    if groups[i]=="A":
        groupA.append(quotes[i])
    if groups[i]=="B":
        groupB.append(quotes[i])
    if groups[i]=="C":
        groupC.append(quotes[i])

def process(group,target):
    
    post = []
    postT = []
    
    for i in range(0, len(group)):

        #Remove punctuation
        text = re.sub('[^a-zA-Z]', ' ', group[i])

        #Convert to lowercase
        text = text.lower()

        #remove tags
        text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)

        # remove special characters and digits
        text=re.sub("(\\d|\\W)+"," ",text)

        ##Convert to list from string
        text = text.split()
        postT.append(text)
        ##Stemming
        ps=PorterStemmer()

        #Lemmatisation
        lem = WordNetLemmatizer()
        text = [lem.lemmatize(word) for word in text if not word in  
                stop_words] 
        text = " ".join(text)
        post.append(text)
    group.append(post)
    target.append(postT)

process(groupA,groupAt)
process(groupB,groupBt)
process(groupC,groupCt)

from sklearn.feature_extraction.text import CountVectorizer
import re
cv=CountVectorizer(max_df=0.8,stop_words=stop_words, max_features=10000, ngram_range=(1,3))
X=cv.fit_transform(groupA[-1])

list(cv.vocabulary_.keys())[:10]
 
#Most frequently occuring words

def get_top_n_words(corpus, n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in      
                   vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], 
                       reverse=True)
    return words_freq[:n]

#Convert most freq words to dataframe for plotting bar plot

top_wordsA = get_top_n_words(groupA[-1], n=20)
top_wordsB = get_top_n_words(groupB[-1], n=20)
top_wordsC = get_top_n_words(groupC[-1], n=20)
top_dfA = pandas.DataFrame(top_wordsA)
top_dfA.columns=["Word", "Freq"]
top_dfB = pandas.DataFrame(top_wordsB)
top_dfB.columns=["Word", "Freq"]
top_dfC = pandas.DataFrame(top_wordsC)
top_dfC.columns=["Word", "Freq"]

print("\n Group A \n")

print(top_dfA)

print("\n Group B \n")

print(top_dfB)

print("\n Group C \n")
print(top_dfC)


 Group A 

          Word  Freq
0     pipeline     5
1    protester     4
2       energy     4
3          law     4
4        state     3
5     transfer     3
6      partner     3
7      federal     3
8       people     3
9        think     3
10      others     3
11     company     3
12    behavior     2
13      safety     2
14      caused     2
15      police     2
16        said     2
17  aggressive     2
18       would     2
19      cannot     2

 Group B 

          Word  Freq
0       people     9
1       nation     6
2         iowa     6
3   indigenous     5
4       dakota     5
5   government     5
6        right     5
7        water     4
8      project     4
9        going     4
10        land     4
11      trying     3
12       would     3
13         say     3
14    industry     3
15         get     3
16         far     3
17        pipe     3
18       force     3
19         use     3

 Group C 

          Word  Freq
0       people    13
1        going    10
2         camp     

In [7]:
concord = []
text_list = textPre.split()

def getConcord(targTerm, c1):
    for i in range(0,len(text_list)):
        if targTerm in text_list[i]:
            snippet = " ".join(text_list[i-25:i+25])
            loc = snippet.index(targTerm)
            line = snippet[loc-35:loc+42]
            if line not in c1:
                c1.append(line)

getConcord("energy", concord)

from collections import Counter

lst = []

for i in concord:
    if "transfer" not in i:
        sp = i.split()
        #print(sp)
        for j in sp:
            if j=="energy":
                ind = sp.index(j)
                lst.append(sp[ind+1])
                
result = Counter(lst)
print(result)

Counter({'development': 13, 'and': 9, 'independence': 7, 'infrastructure': 5, 'projects': 5, 'products': 4, 'model': 4, 'that': 3, 'secretary': 3, 'board': 3, 'partners': 3, 'supply': 3, 'hide': 2, 'regulatory': 2, 'consumption': 2, 'it': 2, 'which': 2, 'in': 2, 'project': 2, 'energy': 2, 'sources': 2, 'on': 2, 'economy': 2, 'an': 2, 'security': 1, 'she': 1, 'trailers': 1, 'the': 1, 'sector': 1, 'firms': 1, 'boom': 1, 'trump': 1, 'foundation': 1, 'colonial': 1, 'information': 1, 'companies': 1, 'away': 1, 'to': 1, 'received': 1, 'even': 1, 'production': 1, 'generation': 1, 'resource': 1, 'indigenous': 1, 'see': 1, 'access': 1, 'those': 1, 'economics': 1, 'giant': 1, 'presidency': 1, 'is': 1, 'protesters': 1, 'industry': 1, 'issues': 1, 'corridor': 1, 'systems': 1, 'commission': 1, 'businesses': 1, 'resources': 1, 'financing': 1, 'additional': 1, 'industries': 1, 'will': 1, 'td': 1, 'campaigner': 1, 'he': 1, 'benefits': 1, 'but': 1, 'of': 1, 'independent': 1, 'here': 1, 'renaissance': 1

### Pronouns

In [8]:
txt_pronouns = open("C:\\Users\\Craig\\Documents\\GitHub\\multilevel_corpus\\Analysis_2\\pronouns.txt")
pronouns = txt_pronouns.readlines()
pronouns = [line[:-1] for line in pronouns]
#print(pronouns)

personPl = ["we","us","ours","our","ourselves"]

import itertools

def proMatch(group):
    countPersonPl=0
    countTot=0
    group = group[0]
    group = list(itertools.chain.from_iterable(group))
    #print(group)
    length = len(group)
    for i in personPl:
        for j in group:
            if i==j:
                ind = group.index(j)
                countPersonPl=countPersonPl+1
                
    for i in pronouns:
        for j in group:
           if i==j:
                countTot=countTot+1 
    print("All: " + str(round(countTot/length,2)))
    print("Personal plural: " + str(round(countPersonPl/length,2)) + "\n")
    

print("Group A")
proMatch(groupAt)

print("Group B")
proMatch(groupBt)

print("Group C")
proMatch(groupCt)

Group A
All: 0.05
Personal plural: 0.02

Group B
All: 0.11
Personal plural: 0.06

Group C
All: 0.12
Personal plural: 0.06



In [9]:
def getQuote(targTerm, group):
    c1 = []
    group = group[0]
    group = list(itertools.chain.from_iterable(group))
    #print(group)
    for i in range(0,len(group)):
        if targTerm in group[i]:
            snippet = " ".join(group)
            #print(snippet)

    
getQuote("protestor", groupAt)
getQuote("protester", groupAt)
#getConcord("economy", groupCt)


In [10]:
for i in range(0, len(groups)):
    if groups[i]=="A":
        print(quotes[i], speakers[i])
        print("\n")

Protesters' escalated unlawful behavior this weekend by setting up illegal roadblocks, trespassing onto private property…this is a public safety issue.     Morton County Sheriff's Department


...damage caused after protesters set numerous fires.   Morton County Sheriff's Department


[The police said the protesters had been] very aggressive   Morton County Sheriff's Department


...multiple archaeological studies conducted with state historic preservation offices found no sacred items along the route   Kelcy Warren, CEO of Energy Transfer Partners


...political interference…further delay in the consideration of this case would add millions of dollars more each month in costs which cannot be recovered.   Energy Transfer Partners


...will only prolong the disruption in the region caused by protests and make life difficult for everyone who lives and works in the area.   North Dakota Senator John Hoeven


[Energy Transfer Partners alleges Greenpeace and other] eco-terrorist groups [trie

In [11]:
from nltk import word_tokenize

c1 = []

def getAdj(targTerm, group):
    
    for i in range(0,len(group)):
        if targTerm in group[i]:
            snippet = word_tokenize(group[i])
            pos = nltk.pos_tag(snippet)
            for i in pos:
                
                if targTerm in i[0]:
                    ind = pos.index(i)
                    if pos[ind-1][1]=='JJ':
                        c1.append(pos[ind-1][0])   

           
getAdj("protester", corpus_PRE)
getAdj("protestor", corpus_PRE)
resultc1 = Counter(c1)
print(resultc1)

Counter({'peaceful': 17, 'american': 13, 'other': 8, 'indigenous': 8, 'august': 8, 'unruly': 6, 'november': 3, 'nodapl': 3, 'alive': 3, 'native': 3, 'many': 2, 'unarmed': 2, 'allied': 2, 'familiar': 1, 'sunday': 1, 'hundred': 1, 'dapl': 1, 'tribal': 1, 'april': 1, 'future': 1, 'saturday': 1, 'early': 1, 'text': 1})
