### Description:
In this notebook, my goal is to index source files and source sentence for each word. <br>

In **Stage 1: Import data**:<br>
I just import processed data I did in notebook **EDA_1_...** which has 4 columns: **FileName**, **Sentence**, 
**Clean sentence**, **Lemmatized**.<br>

In **Stage 2: Indexing**:<br>
I get count of unique tokens all of six files as in previous notebook **EDA_3_...**, <br>
but this time I also index/store source files names and source sentences (i.e sentences where the word was is found).<br>
Final results are in **findings_df** dataframe whic has 3 columns: **Token_and_count**, **FileNames**, **Sentences**.<br>
 <br>
Column **Token_and_count** has unique word and it's count <br>
Column **FileNames** has file names the word is found <br>
Column **Sentences** has sentences the word is found <br>

In [2]:
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
from gensim.parsing.preprocessing import STOPWORDS
from wordcloud import WordCloud, ImageColorGenerator  #STOPWORDS

### Stage 1: Import processed data

In [3]:
# 1
# Import data
df = pd.read_csv("Processed_data.csv")

# 2
# Remove rows containing NaN
df = df[df['Clean sentence'].notnull()]

# 3
# Check data
print("DF shape:", df.shape)
df.head()

DF shape: (935, 4)


Unnamed: 0,FileName,Sentence,Clean sentence,Lemmatized
0,doc1.txt,Let me begin by saying thanks to all you who'v...,let me begin thanks traveled far wide brave co...,let i begin thank travel far wide brave cold t...
1,doc1.txt,We all made this journey for a reason.,journey reason,journey reason
2,doc1.txt,"It's humbling, but in my heart I know you didn...",humbling heart i know come me came believe cou...,humble heart i know come i come believe country
3,doc1.txt,"In the face of war, you believe there can be p...",face war believe peace,face war believe peace
4,doc1.txt,"In the face of despair, you believe there can ...",face despair believe hope,face despair believe hope


### Stage 2: Indexing

### Get tokens ferquencies across all files

In [4]:
proc_tokens = []
for s in df["Clean sentence"].tolist():
    proc_tokens.extend(s.split())

In [5]:
# 5
# Sort token counts in desc order
term2freq = Counter(proc_tokens).most_common()

### Token and it's source files and sentences

In [6]:
# Create empty dataframe
columns = ['Token_and_count', 'FileNames', 'Sentences']
findings_df = pd.DataFrame(columns=columns)

for t,c in term2freq[:10]:
# for t,c in [("people", 1)]:
    df = df[df["Clean sentence"].str.contains(t)]
    
    findings_df = findings_df.append({'Token_and_count': " ".join([t,str(c)]), \
                                      'FileNames': df.FileName.tolist(), \
                                      'Sentences': df.Sentence.tolist()}, ignore_index=True)

In [7]:
findings_df.head()

Unnamed: 0,Token_and_count,FileNames,Sentences
0,i 247,"[doc1.txt, doc1.txt, doc1.txt, doc1.txt, doc1....",[Let me begin by saying thanks to all you who'...
1,people 68,"[doc1.txt, doc1.txt, doc1.txt, doc1.txt, doc1....",[In the face of a politics that's shut you out...
2,iraq 64,"[doc5.txt, doc5.txt, doc5.txt]","[A few Tuesdays ago, the American people embra..."
3,country 60,[doc5.txt],"[Today, the Iraqi landscape is littered with i..."
4,time 60,[],[]


In [8]:
w = "people"

for row in findings_df.iterrows():
    if w in row[1]["Token_and_count"]:
#         pass
        print(row[1]["Token_and_count"])
        print(row[1]["FileNames"])
        print(row[1]["Sentences"])

people 68
['doc1.txt', 'doc1.txt', 'doc1.txt', 'doc1.txt', 'doc1.txt', 'doc1.txt', 'doc1.txt', 'doc1.txt', 'doc1.txt', 'doc1.txt', 'doc1.txt', 'doc1.txt', 'doc1.txt', 'doc2.txt', 'doc2.txt', 'doc2.txt', 'doc2.txt', 'doc2.txt', 'doc2.txt', 'doc2.txt', 'doc2.txt', 'doc2.txt', 'doc2.txt', 'doc3.txt', 'doc3.txt', 'doc3.txt', 'doc3.txt', 'doc3.txt', 'doc3.txt', 'doc3.txt', 'doc3.txt', 'doc3.txt', 'doc3.txt', 'doc3.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc4.txt', 'doc5.txt', 'doc5.txt', 'doc5.txt', 'doc5.txt', 'doc5.txt', 'doc5.txt', 'doc5.txt', 'doc5.txt', 'doc5.txt', 'doc6.txt', 'doc6.txt']
["In the face of a politics that's shut you out, that's told you to settle, that's divided us for too long, you believe we can be one people, reaching for what's possible, building that more perfect union.", '