# Analysis 3
## Non-verbal Corpus of Mining Projects

This is the third and final analyses of corpus linguistic data related to ecological themes. This corpus consists of audio and video representing different perspectives on mining and natural resource development. These included recorded interviews, documentaries, recordings of 'town hall' type meetings. The data was collected manually using a search engine. Transcripts were then obtained for each media item and saved as separate *.txt* files. 

### Read Data 

There were 25 *.txt* files in total each with a url associated back to the original audeo/visual media. The transcripts included timestamps (e.g. 05:45). By looping through the transcripts and getting the last timestamp, we can obtain the total runtime of the media.

Running this block shows that the corpus of transcripts is about 88,000 words and the total media runtume is about 7 hours 45 minutes.

In [2]:
import glob   
path = 'C:\\Users\\Craig\\Documents\\GitHub\\multilevel_corpus\\Analysis_3\\corpus\\'   

# initialize variables
total=0
files=0
times = []
names=[]
all = ""

# loop through files and set variable values
txt_files = glob.glob(path+"*.txt")
for filename in txt_files:
    files+=1
    with open(filename, "r", encoding="utf-8") as f:
        name = filename.replace(path,"")
        x = f.readlines()
        xtimes=[]
        for i in x[-3:]:
            if ":" in i and ": " not in i:
                xtimes.append(i)
        for j in x:
            all=all+ " " + j
        times.append(xtimes[-1].strip())

# convert to time     
minutes = 0
seconds = 0

for time in times:
    j=time.split(':')
    minutes+=int(j[0])
    seconds+=int(j[1])
    
total = (minutes+seconds/60)/60  
average = int((total/files)*60)
f = int(total)
frac = (total-f) * 60

# word count of transcripts
words = all.split(' ')
wordCount = len(words)
numWords = "{:,}".format(wordCount)

# print results
print("number of files: " + "\t" + str(files))
print("word count: " + "\t\t" + str(numWords))
print("total runtime: " + "\t\t" + str(f) + " hours " + str(int(frac)) + " minutes ")   
print("average runtime: " + "\t" + str(average) + " minutes")


number of files: 	25
word count: 		88,140
total runtime: 		7 hours 46 minutes 
average runtime: 	18 minutes


### Pre-processing

The code below does preprocessing on the text. The preprocessing consists of:

1. **Noise removal** (removal of punctuation, special characters, digits)
2. **Normalization** (stemming, lemmatization, removal of stopwords) 

An exceprt from the preprocessed corpus is then printed.

In [3]:
import re
import nltk

#nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer

#nltk.download('wordnet') 

from nltk.stem.wordnet import WordNetLemmatizer

##Creating a list of stop words

stop_words = set(stopwords.words("english"))

corpus_PRE = []
corpus = []

#Remove hyperlinks 

text = re.sub(r'http\S+', '', all)

#Remove punctuation
text = re.sub('[^a-zA-Z]', ' ', text)

#Convert to lowercase
text = text.lower()

#remove tags
text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)

# remove special characters and digits
text=re.sub("(\\d|\\W)+"," ",text)

corpus_PRE.append(text)

##Convert to list from string
text = text.split()

##Stemming
ps=PorterStemmer()

#Lemmatisation
lem = WordNetLemmatizer()
text = [lem.lemmatize(word) for word in text if not word in  
		stop_words] 
text = " ".join(text)
corpus.append(text)
    
p = corpus_PRE[0].split(' ')[2210:2280]
print(" ".join(p))

opencast mine so we cannot make it a little bit darker better for this for the slides okay i will show you more or details of these technique and but this okay this is a quite nice picture showing you the situation here north of caucus yeah things just completely down completely not good better ok this is the conveyor bridge which is a quite unique technique developed here analyzation


### Corpus Contents

To get an overview of the contents of the corpus, we output the titles and hyperlinks from a random sample.

In [4]:
titles = []
links = []

for filename in txt_files:
    with open(filename, "r", encoding="utf-8") as f:
        x = f.readlines()
        titles.append(x[0])
        links.append(x[1])

        
print("Random sample of media:" + "\n")

import random
ind = random.randint(0,22)

for i in range(ind,ind+3):
    print(titles[i] + links[i])

Random sample of media:

Broadening the debate on platinum mining sector
https://www.youtube.com/watch?v=lb9SaO3PTOU&t=60s

Filmmaker Nettie Wild finds cinematic poetry in 'polarized' mining debate
https://www.cbc.ca/radio/thecurrent/the-current-for-may-6-2016-1.3569650/filmmaker-nettie-wild-finds-cinematic-poetry-in-polarized-mining-debate-1.3569733

Polymet Mining Issue -8th Congressional Debate
https://www.youtube.com/watch?v=J1xcp_AhWdo



### Keywords

In [5]:
# code adapted from https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34

t=corpus[0].split(' ')

from sklearn.feature_extraction.text import CountVectorizer
import re
cv=CountVectorizer(max_df=0.8,stop_words=stop_words, max_features=10000, ngram_range=(1,3))
X=cv.fit_transform(t)

list(cv.vocabulary_.keys())[:10]

import pandas

#Most frequently occuring words

def get_top_n_words(corpus, n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in      
                   vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], 
                       reverse=True)
    return words_freq[:n]

#Convert most freq words to dataframe for plotting bar plot

top_words = get_top_n_words(t, n=20)
top_df = pandas.DataFrame(top_words)
top_df.columns=["Word", "Freq"]

print(top_df)

#Barplot of most freq words

import seaborn as sns
sns.set(rc={'figure.figsize':(13,8)})
g = sns.barplot(x="Word", y="Freq", data=top_df,  palette="Blues_d")
g.set_xticklabels(g.get_xticklabels(), rotation=30)

       Word  Freq
0      know   517
1    people   325
2     right   271
3    mining   268
4     think   252
5       one   237
6      like   234
7      mine   226
8       say   222
9     going   220
10     year   210
11     well   203
12       go   191
13      get   190
14  company   186
15    thing   180
16      see   159
17      got   157
18     want   156
19   really   151


[Text(0, 0, 'know'),
 Text(0, 0, 'people'),
 Text(0, 0, 'right'),
 Text(0, 0, 'mining'),
 Text(0, 0, 'think'),
 Text(0, 0, 'one'),
 Text(0, 0, 'like'),
 Text(0, 0, 'mine'),
 Text(0, 0, 'say'),
 Text(0, 0, 'going'),
 Text(0, 0, 'year'),
 Text(0, 0, 'well'),
 Text(0, 0, 'go'),
 Text(0, 0, 'get'),
 Text(0, 0, 'company'),
 Text(0, 0, 'thing'),
 Text(0, 0, 'see'),
 Text(0, 0, 'got'),
 Text(0, 0, 'want'),
 Text(0, 0, 'really')]

In [6]:
concord = []
text_list = corpus_PRE[0].split(' ')
#print(text_list)

def getLines(target):

    for i in range(0,len(text_list)):
        if target in text_list[i]:
            snippet = " ".join(text_list[i-15:i+15])
            if target in snippet:
                loc = snippet.index(target)
                line = snippet[loc-35:loc+42]
                if line not in concord:
                    concord.append(line)

getLines("eco")
                    
print("count:" +str(len(concord))+ "\n")
print('Random sample:' ) 

if len(concord)>10:
    from random import sample
    chosen = sample(concord, 10)
    for i in chosen:
        print(i)
else:
    for i in concord:
        print(i)

count:144

Random sample:
he bottom of the sea you will lose ecosystem values the question is how to mi
sonnel you have to have the track record and i m telling you that cliffs reso
to reality uranium has been a huge economic boom to this whole area these sma
s six agonizing months to get the records we discovered there was texting thr
mportant to understand that phone records don t provide a means to detect any
projects and now so coming to the second part giving you an introduction to w
much so that saving messianic has become a cause celeb on the world conservat
sedimentation with sand so we can reconstruct the landscape that we know that
the development of the rest of the economy that s the pattern you know you kn
not going to show up on the phone records i think one of the problems in the 


### Ecological Level

In [7]:
import cv2, pafy
from IPython.display import HTML

# Youtube
import warnings; warnings.simplefilter('ignore')
print("https://www.youtube.com/embed/-UPjsuuyvD4?start=632&end=653")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/-UPjsuuyvD4?start=632&end=653;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')


https://www.youtube.com/embed/-UPjsuuyvD4?start=632&end=653


In [8]:
import os 

ims1 = []

# Function to extract frames 
def FrameCapture(path,folder,ims):
    
    pathnew=path[0:-6]
    
    os.mkdir(pathnew+folder)    
        
    # Path to video file 
    vidObj = cv2.VideoCapture(path) 
  
    # Used as counter variable 
    count = 0
  
    # checks whether frames were extracted 
    success = 1
  
    while success: 
  
        # vidObj object calls read 
        # function extract frames 
        success, image = vidObj.read() 
  
        # Saves the frames with frame-count 
        
        cv2.imwrite(pathnew+folder+"\\frame%d.png" % count, image) 
        k=(pathnew+folder+"\\frame%d.png" % count)
        ims.append(k)
  
        count += 1
  
#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\ecological\\05.mp4', "e1",ims1) 
print("Number of frames: " + str(len(ims1)))

Number of frames: 0


In [9]:
from matplotlib.pyplot import figure, imshow, axis
from matplotlib.image import imread

def showImagesHorizontally(list_of_files):
    fig = figure(figsize = (40,5))
    number_of_files = len(list_of_files)
    for i in range(number_of_files):
        a=fig.add_subplot(1,number_of_files,i+1)
        image = imread(list_of_files[i])
        imshow(image,cmap='Greys_r')
        axis('off')

for i in range(0,12,4):
    showImagesHorizontally(ims1[i:i+4])

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

In [10]:
# Youtube
warnings.simplefilter('ignore')
print("https://www.youtube.com/embed/Sh0_Wf8F4RM?start=857&end=888")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/Sh0_Wf8F4RM?start=857&end=888;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')


https://www.youtube.com/embed/-UPjsuuyvD4?start=632&end=653


In [11]:
ims2 = []

#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\ecological\\10.mp4', "e2",ims2) 
print("Number of frames: " + str(len(ims2)))

Number of frames: 0


In [12]:
for i in range(552,564,4):
    showImagesHorizontally(ims2[i:i+4])

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

In [13]:
# Youtube
warnings.simplefilter('ignore')
print("https://www.youtube.com/embed/UvKe2LYy5pk?start=920&end=945")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/UvKe2LYy5pk?start=920&end=945;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')


https://www.youtube.com/embed/UvKe2LYy5pk?start=920&end=945


In [14]:
ims3 = []

#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\ecological\\12.mp4', "e3",ims3) 
print("Number of frames: " + str(len(ims3)))

Number of frames: 0


In [15]:
for i in range(258,270,4):
    showImagesHorizontally(ims3[i:i+4])

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

In [16]:
# Youtube
warnings.simplefilter('ignore')
print("https://www.youtube.com/embed/vBhvFWRLiOs?start=821&end=829")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/vBhvFWRLiOs?start=821&end=829;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')


https://www.youtube.com/embed/vBhvFWRLiOs?start=460&end=1470


In [18]:
ims3a = []

#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\ecological\\18.mp4', "e4",ims3a) 
print("Number of frames: " + str(len(ims3a)))

Number of frames: 0


### Cultural Level

In [21]:

import warnings; warnings.simplefilter('ignore')
# Youtube
print("https://www.youtube.com/embed/z6ewpjWYfYo?start=535&end=555")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/z6ewpjWYfYo?start=535&end=555;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

In [22]:
ims4 = []

#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\cultural\\02.mp4', "e1",ims4) 
print("Number of frames: " + str(len(ims4)))

Number of frames: 0


In [23]:
for i in range(504,516,4):
    showImagesHorizontally(ims4[i:i+4])

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

In [26]:
warnings.simplefilter('ignore')
# Youtube
print("https://www.youtube.com/embed/10FrfEa0Xck?start=33&end=45")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/10FrfEa0Xck?start=33&end=45;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

https://www.youtube.com/embed/10FrfEa0Xck?start=33&end=45


In [27]:
ims5 = []
#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\cultural\\03.mp4', "e2",ims5) 
print("Number of frames: " + str(len(ims5)))

Number of frames: 0


In [28]:
for i in range(54,66,4):
    showImagesHorizontally(ims5[i:i+4])

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

In [32]:
warnings.simplefilter('ignore')
# Youtube
print("https://www.youtube.com/embed/awnLI4pRnUM?start=42&end=52")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/awnLI4pRnUM?start=43&end=58;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

https://www.youtube.com/embed/awnLI4pRnUM?start=42&end=52


In [33]:
ims6 = []
#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\cultural\\04.mp4', "e3",ims6) 
print("Number of frames: " + str(len(ims6)))

Number of frames: 0


In [34]:
for i in range(18,30,4):
    showImagesHorizontally(ims6[i:i+4])

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

In [39]:
warnings.simplefilter('ignore')
# Youtube
print("https://www.youtube.com/embed/UvKe2LYy5pk?start=1198&end=1220")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/UvKe2LYy5pk?start=1198&end=1220;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

https://www.youtube.com/embed/UvKe2LYy5pk?start=1158&end=1208


In [40]:
ims7 = []
#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\cultural\\14.mp4', "e4",ims7) 
print("Number of frames: " + str(len(ims7)))

Number of frames: 0


In [41]:
for i in range(422,434,4):
    showImagesHorizontally(ims7[i:i+4])

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

In [45]:
warnings.simplefilter('ignore')
# Youtube
print("https://www.youtube.com/embed/vBhvFWRLiOs?start=467&end=476")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/vBhvFWRLiOs?start=467&end=476;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

https://www.youtube.com/embed/vBhvFWRLiOs?start=467&end=476


In [46]:
ims8 = []
#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\cultural\\17.mp4', "e5",ims8) 
print("Number of frames: " + str(len(ims8)))

Number of frames: 0


In [47]:
for i in range(98,110,4):
    showImagesHorizontally(ims8[i:i+4])

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

### Socio-Economic Level

In [48]:
warnings.simplefilter('ignore')
# Youtube
print("https://www.youtube.com/embed/gU7PBoy-wFE?start=10&end=21")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/gU7PBoy-wFE?start=10&end=21;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

https://www.youtube.com/embed/gU7PBoy-wFE?start=10&end=21


In [49]:
ims9 = []
#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\economic\\06.mp4', "e1",ims9) 
print("Number of frames: " + str(len(ims9)))

Number of frames: 0


In [50]:
for i in range(292,304,4):
    showImagesHorizontally(ims9[i:i+4])

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

In [54]:
warnings.simplefilter('ignore')
# Youtube
print("https://www.youtube.com/embed/Sh0_Wf8F4RM?start=390&end=420")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/Sh0_Wf8F4RM?start=390&end=420;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

https://www.youtube.com/embed/Sh0_Wf8F4RM?start=390&end=420


In [55]:
ims10 = []
#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\economic\\11.mp4', "e2",ims10) 
print("Number of frames: " + str(len(ims10)))

Number of frames: 0


In [56]:
for i in range(352,364,4):
    showImagesHorizontally(ims10[i:i+4])

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

<Figure size 2880x360 with 0 Axes>

In [57]:
warnings.simplefilter('ignore')
# Youtube
print("https://www.youtube.com/embed/vBhvFWRLiOs?start=1299&end=1316")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/vBhvFWRLiOs?start=1299&end=1316;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

https://www.youtube.com/embed/vBhvFWRLiOs?start=1299&end=1316


In [58]:
ims11 = []
#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\economic\\19.mp4', "e3",ims11) 
print("Number of frames: " + str(len(ims11)))

Number of frames: 0


In [62]:
warnings.simplefilter('ignore')
# Youtube
print("https://www.youtube.com/embed/vBhvFWRLiOs?start=58&end=75")
HTML('<iframe width="400" height="315" src="https://www.youtube.com/embed/vBhvFWRLiOs?start=58&end=75;rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

https://www.youtube.com/embed/vBhvFWRLiOs?start=58&end=75


In [None]:
ims12 = []
#FrameCapture('C:\\Users\\Craig\\Documents\\videoclips\\economic\\16.mp4', "e4",ims12) 
print("Number of frames: " + str(len(ims12)))