Given filter words, fetches tweets and shows polarity, sentiment scores, topic modeling, and a word map based on the fetched tweets.
Twitter is A social media on which enormous data is being generated. Using Twitter Data, it is possible to various analyses on a particular product or entity. So, in this report, we will see step-by-step how to make Twitter data analysis.
The first step should be to understand your problem, what data it requires, and where you could get it. In this case, we can extract Twitter data using the API key provided by the Twitter Company upon request. This data is pulled into JSON format, which is a little bit difficult to be readable by humans.
We should first convert the JSON file to DataFrame using pandas python library. And then save it as a CSSV file for further use. The typecasting should be performed to get a suitable and sensible data format. Special characters, emojis, and unwanted content should be cleared from the DataFrame.
Missing and None values should be handled. In some cases, we may fill missing values with a reasonable value. But if it is not non-sense to fill these values, they should be dropped.
handled extract_dataframe.py, clean_tweets_dataframe.py, preprocess_tweets_data.py
After a clean DataFrame is generated, we should carry out the EDA to get an insight from the data. This insight will help to achieve our objective. In this section, we should be able to explore the statistical relationship between attributes. For example, we can extract the most common user mentions, the most hashtags, the number of positive and negative sentiments, etc.
handled by JupyterNotebook/EDA.ipynb
The next step will be modeling to develop a system that can solve the challenge we are facing. Our objective is to perform sentimental analysis and topic modeling. In the first task, we are going to develop a classification algorithm. For the second, we use the unsupervised LDA model. handled by SentimentalAnalysis.ipynb and TopicModeling.ipynb in the JupyterNotebooks
SQLAlchamy is used with pandas to have a higher level interface to the database. All the database related functionalities is handled by mysql_manager.py file
For this part I used streamlit to show different finding I got from the EDA notebook. In addition there are wordclouds gennerated using hashtags,user mensions and tweet texts.
The most important takeaway from the week0 challenge is the MLOps pipeline. MLOps can help automate the steps from Data Engineering to the model deployment Phase. First, the data, the features generated by the data engineering phase, is stored on SQL/NoSQL. We should also register the model parameters and performance. While deployment, we fetch these values from the database and use them. This can also help in versioning the data and model. If there is a data drift, model performance Decay, and requirement change, we can set an alert and retrain the model and do the process from the start again.
- More test coverage
- Add logging
- More and better exception handling
- More data analysis and modeling
- Add Model/Data drift detection
- Integrate model to the dashboard