Tweet text data parsing/cleaning for nlp #25

wwymak · 2017-02-02T00:11:03Z

Look through data available at https://data.world/data4democracy/far-right as data from the discursive project

Some of the tasks we might do are:

Stem
Tokenize
Remove stop words
List of stop words: https://pypi.python.org/pypi/stop-words (most nlp libraries also come with them)

Depending on what you want to achieve, you might not need all of the above (e.g. for training word2vec, you might not need to do any of that, but you might want to convert emojis)

Tag POS (for further sentiment analysis)

Useful libraries:
spaCy
NLTK
sklearn
TextBlob
gensim
Mallet

I'm exploring what is possible/needed at the mo with @divya -- but feel free to chip in with opinions, ideas, especially if you're an nlp expert :)

jss367 · 2017-02-25T18:03:01Z

I saw the "help wanted" tag on this so I built a notebook that inputs tweets, then tokenizes, removes stop words, and stems the tweets. It's called CleanText.ipynb if you want to take a look at it. I'd be happy to make changes or additions if you have any suggestions.

wwymak added help wanted status-in-progress labels Feb 2, 2017

wwymak self-assigned this Feb 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweet text data parsing/cleaning for nlp #25

Tweet text data parsing/cleaning for nlp #25

wwymak commented Feb 2, 2017

jss367 commented Feb 25, 2017

Tweet text data parsing/cleaning for nlp #25

Tweet text data parsing/cleaning for nlp #25

Comments

wwymak commented Feb 2, 2017

jss367 commented Feb 25, 2017