Skip to content

UBC-MDS/tweepyclean

Repository files navigation

tweepyclean

build codecov Deploy Documentation Status

tweepyclean is a Python package built to act as a processor of data generated by the existing Tweepy package that can produce clean data frames, summarize data, and generate new features.

Our package aims to add additional resources for users of the already existing Tweepy package. Tweepy is a package built around Twitter's API and is used to scrape tweet information from their servers. Our package creates functionality which enables users to process the raw data from Tweepy into a more understandable format by extracting and organizing the contents of tweets for a user. tweepyclean is specifically built to be used in analysis of a specific user's timeline (generated using tweepy's api.user_timeline function). Users can easily visualize average engagement based on time of day posted, see basic summary statistics of word contents and sentiment analysis of tweets and have a processed dataset that can be used in a wide variety of machine learning models.

Installation

$ pip install -i https://test.pypi.org/simple/ tweepyclean

Features

Functions capabilities include...

  • raw_df(): The ability to generate a dataframe from the a tweepy.cursor.ItemIterator object returned by calling tweepy.Cursor(api.user_timeline,id=username, tweet_mode='extended').items() with the tweepy package. This is done to retrieve a twitter users timeline and process it into an easily workable dataframe containing a row for each tweet with the columns generated by default by tweepy (i.e. tweet text, number of favorites, number of retweets, etc.).

  • clean_tweets(): The ability to add new columns to the dataframe generated by raw_df() containing information such as a link/hashtag/emoji free version of the text column, a wordcount, a sentimentality score, a flesch readability score, a user entered handle, a column containing all emojis used, and a column containing all hashtags used.

  • engagement_by_hours(): generates a line chart from the cleaned or raw dataframe to see favorites/retweets received by hour in the dataframe

  • tweet_words(): generates a list from the cleaned dataframe (needs the text_only column generated by clean_tweets()) of the most frequently used words by the users timeline

  • sentiment_total(): generates a line chart from the cleaned dataframe (needs the text_only column generated by clean_tweets()) of the number of tweeted words associated with particular emotional sentiments.

Dependencies

Python 3 or greater

Python package Version
altair ^4.1.0
nltk ^3.5
textstat ^0.7.0
emoji ^1.2.0
tweepy ^3.10.0
pytest ^6.2.2

Usage

Functions

raw_df(tweets) :Creates a dataframe with labeled columns from a tweepy.cursor.ItemIterator object. The dataframe will have labeled columns containing the id, created_at, full_text, favorite_count, retweet_count, retweeted, entities, in_reply_to_user_id, and source columns from the iterator

clean_tweets(tweets, handle = "", text_only = True, word_count = True, emojis = True, hashtags = True, sentiment = True, flesch_readability = True, proportion_of_avg_retweets = True, proportion_of_avg_hearts = True): Creates new columns based on the data in the pandas.dataframe generated by raw_df() and returns a new dataframe. Can generate the following columns.

  • handle: Adds a column containing the a specified twitter handle.

  • text_only: Adds a column of the tweet text containing no emojis, links, hashtags, or mentions.

  • word_count: Adds a column of the number of words in text_only col.

  • emojis: Adds a column of the extracted emojis from tweet text and places them in their own column

  • hashtags: Add a column of the extracted hashtags from tweet text

  • sentiment: add a column containing the nltk.sentiment.vader SentimentIntensityAnalyzer sentiment score for each tweet

  • flesch_readability: Adds a column containing the textstat flesch readability score (default is True)

  • proportion_of_avg_retweets: Adds a column containing a proportion value of how many retweets a tweet received compared to the account average.

  • proportion_of_avg_hearts: Adds a column containing a proportion value of how many hearts a tweet received compared to the account average

engagement_by_hour(tweets) : Creates an Altair line chart of total number of likes and retweets received by hour of tweet posted.

tweet_words(clean_dataframe, top_n) : Returns a pandas.DataFrame of the most common words and counts from a list of tweets.

sentiment_total(data, lexicon): Takes unaggregated tweet data and summarizes the number of tweeted words associated with particular emotional sentiments. Returns an Altair linechart.

tweepyclean's Place in the Python Ecosystem

tweepyclean provides functionality that is the first of its kind. Working with tweepy data has always required extensive data processing in order to produce a clean dataframe with useful features. By using tweepyclean it is easy and straightforward to extract the data hidden within the features that tweepy already scrapes, while also allowing users to optionally apply various forms of statistical analysis and language processing tools (such as sentiment analysis) to the data. This is combined with streamlined summary statistics methods that can quickly and effortlessly produce figures and tables of various different factors in your tweepy data. This allows users to easily understand and analyze information about a twitter user's timeline. Specifically, examining an accounts engagement, most common words, and emotional sentiment can each be done with a single function.

Documentation

The official documentation is hosted on Read the Docs: https://tweepyclean.readthedocs.io/en/latest/

Contributors

We welcome and recognize all contributions. Please see contributing guidelines in the Contributing document. This repository is currently maintained by @nashmakh, @calsvein, @MattTPin, @syadk.

Credits

This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.