Natural_Language_Processing_Tweets

Text documents, such as crawled web data, are usually comprised of topically coherent text data, which within each topically coherent data, one would expect that the word usage demonstrates more consistent lexical distributions than that across data-set. A linear partition of texts into topic segments can be used for text analysis tasks, such as passage retrieval in IR (information retrieval), document summarization, recommender systems, and learning-to-rank methods.

In order to perform such text analysis tasks, this project firstly extracts text data from tweets saved in an XML format. Seconldy, a set of text pre-processing steps are undertaken including:

Generate the corpus vocabulary with the same structure as sample_vocab.txt. Please note that the vocabulary must be sorted alphabetically.
For each day (i.e., sheet in your excel file), calculate the top 100 frequent unigram and top-100 frequent bigrams according to the structure of the sample_100uni.txt and sample_100bi.txt. If you have less than 100 bigrams for a particular day, just include the top-n bigrams for that day (n<100).
Generate the sparse representation (i.e., doc-term matrix) of the excel file according to the structure of the sample_countVec.txt

The output of this processing is saved in the files:

vocab.txt
countVec.txt
100uni.txt
100bi.txt

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
100bi.txt		100bi.txt
100uni.txt		100uni.txt
Parsing_Text_Files_XML.ipynb		Parsing_Text_Files_XML.ipynb
README.md		README.md
Text_Pre_Processing.ipynb		Text_Pre_Processing.ipynb
countVec.txt		countVec.txt
vocab.txt		vocab.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural_Language_Processing_Tweets

About

Releases

Packages

Languages

aber0016/Basic_Natural_Language_Processing_Tweets

Folders and files

Latest commit

History

Repository files navigation

Natural_Language_Processing_Tweets

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages