Repository containing portfolio of ongoing and completed data science projects completed by me for the Practicum Data Science bootcamp and academic, self learning, and hobby purposes.
Following are the highlights of the projects:
-
Languages/Software: Python, PostgreSQL, HTML, CSS, SPSS Statistics, Microsoft Excel/Access, Qualtrics Surveys, Google Analytics, Adobe Analytics
-
Tools: pandas, NumPy, seaborn, matplotlib, plotly, sciPy, scikit-learn, TensorFlow, Keras, geopandas, folium, langid, Beautiful Soup, Selenium, transformers, NLP (NLTK, spaCy, BERT), librosa, spotipy, PySpark, electronJs, shapely, SSH, SFTP, Unix/Linux
- Machine Learning Models Evaluated:
- Classification Models: DecisionTreeClassifier, RandomForestClassifier, KNeighborsClassifier, AdaBoostClassifier, XGBClassifier, CatBoostClassifier, LGBMClassifier, k-NNClassifier
- Regression Models: LinearRegressor, DecisionTreeRegressor, RandomForestRegressor, LogisticRegression, XGBRegressor, LGBMRegressor, CatBoostRegressor
- Psychological Assessment Scales and Measures:
- Multigroup Ethnic Identity Measure (MEIM)
- State-Trait Anxiety Inventory (STAI)
Contributed research to the company's infrastructure, with the goal of training a deep learning model using BERT to predict user geolocation from individual tweets. Yachay is an open-source Machine Learning community that has collected decades worth of useful natural language data from various sources.
Highlights:
- Identified and gathered relevant data from various sources.
- Performed exploratory data analysis to gain insights into the data
- Utilized Hugging Face NLP pipelines to extract various text features, including sentiment analysis, topic identification, and language detection
- Preprocessed the text data (e.g., tokenization, removing stop words) using BERT for model training
- Evaluated model performance using appropriate MSE and haversine loss
- Median and mean differences between predicted and actual distances were 1,334 km and 1,881 km, respectively, demonstrating the model's accuracy.
- Acknowledged the workflow's potential to provide valuable geolocation prediction capabilities, with the possibility of scaling and integrating it into the existing infrastructure for real-time application
Tools: Python, seaborn, folium, scikit-learn, langid, geopandas, tensorflow, BERT
Tasked with developing a Python-based regression model to predict the valence of pop songs for playlist curation and other applications. An automatic method of classifying the valence of pop songs is useful for playlist curation and other applications.
Highlights:
- Data collection and extraction using Spotify’s Web API
- Audio feature extraction and analysis from mp3 files using Librosa (python package for music analysis)
- Regression analysis to predict valence using both song lyrics (NLP) and audio features as input
- Implemented various approaches to train and validate models to forecast valence scores of songs
- Conducted model training, validation, and hyperparameter optimization using RandomSearchCV on four regression models: Random Forest, K-Nearest Neighbors, XGBoost, and Linear Regression.
- Selected Support Vector Regression model trained on 191 normalized audio features, achieving an RMSE score of 0.16 on testing data, meeting the company's desired model performance standards
Tools: pandas, numPy, matplotlib, seaborn, spotipy, transformers, sklearn, Librosa
Researching grant prospects can be time-consuming and overwhelming. Develop an automation system for a nonprofit organization (DataReady DFW) to find available grant opportunities and fill out applications with little or no human intervention.
Highlights:
- Database management; collected, analyzed, and interpreted raw public grant data
- Developed and implemented scripts to autofill constant values
- Collection, reporting, and analysis of website data
- Filtered/trained script to select further grant opportunities using Natural Language Processing
Tools: TensorFlow, Beautiful Soup, Selenium, pandas, ntlk, Google Analytics
Hands-on Experience | Project | Technical skills |
---|---|---|
Unsupervised Learning | Client Churn Forecast | Machine learning algorithms, XGBoost, CatBoost, LightGBM |
Computer Vision (CV) | Computer Vision Age Detection w/ deep learning | Tensorflow |
Natural Language Processing (NLP) | IMDB Movie Sentiment Analysis using NLP | SGDClassifier, Naïve bayes, LightGBM, spaCy, TF-IDF, BERT |
Time Series Analysis | Time Series Forecast | Time Series Analysis, CatBoost, LightGBM, XGBoost |
Machine Learning in Business | Gold Recovery Regression Model | Python, Scikit-learn, LinearRegression |
Numerical Methods with ML | Vehicle Market Value Prediction w/ gradient boosting | Numerical Methods, CatBoost, LightGBM, XGBoost |
Linear Algebra with Machine Learning | Insurance Benefits Predictive Model | Scikit-learn, Linear Algebra, k-Nearest Neighbors |
Machine Learning - Classification | Telecom Plan Classification Model | Python, Scikit-learn, Pandas |
Supervised Learning - Prediction | Bank Customer Churn Prediction | Scikit-learn, XGBoost, GridSearchCV, AdaBoost |
Machine Learning in Business | Oil Well Regression Model | Python, Scikit-learn, Bootstrapping, LinearRegression |
Webscraping and Data Storage | Data Collection, Webscraping and Storage | PostgreSQL, Python, BeautifulSoup, Seaborn, Matplotlib, ETL (extract, transform and load) |
Data Visualization and Storytelling with Data | Video Game Market Analysis | Python, Pandas, Squarify, Seaborn, Matplotlib |
Data Preprocessing | Credit Score Analysis | Python, NLTK, WordNetLemmatizer, SnowballStemmer, Seaborn, Matplotlib |
Exploratory Data Analysis (EDA) | Vehicle Market Analysis | Pandas, Matplotlib |
Statistical Data Analysis (SDA) | Telecom Customer Data Analysis | Python, pandas, Numpy, SciPy, Seaborn, Matplotlib |
-
- Client Churn Forecast w/ Machine Learning: Forecast the churn of clients for a telecom company. Collect the necessary information to assist the marketing department in figuring out different ways of retaining clients.
- Computer Vision Age Detection: Built a regression model for computer vision, in order to predict the approximate age of a customer from a Supermarket checkout photograph.
- IMDB Movie Sentiment Analysis using NLP: Build a machine learning model to automatically detect negative reviews for a system used to filter and categorize movie reviews.
- Time Series Forecast: Use historical data on taxi orders at airports to create a model that predicts the number of taxi orders for the next hour.
- Vehicle Market Value Prediction: Determine the market value of a used car using historical car data. Identify the quality and speed of prediction for various models.
- Insurance Benefits Predictive Model: Linear regression for insurance benefits prediction
- Gold Recovery Regression Model: Prepare a prototype of a machine learning model for Zyfra to predict the amount of gold recovered from the gold ore for the purpose of optimizing production and eliminating unprofitable parameters. The company develops efficiency solutions for the heavy industry.
- Telecom Plan Classification Model: Build machine learning model to identify the right plan for each subscriber based on their behavior, using the historical data available.
- Bank Customer Churn Prediction Model: Creating a classification model to predict customer churn for a bank from an imbalanced dataset.
- Oil Well Regression Model: Build a Linear Regression model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrapping technique.
Tools: sklearn, Pandas, Seaborn, Matplotlib, TensorFlow, PIL (Python Imaging Library)
-
- Video Game Market Analysis: Data Analysis identifying what factors make a video game succeed. Identify patterns in historical game sales data, analyze metrics for each video game platform, and conduct statistical hypothesis testing to find potential big winners and plan advertising campaigns.
- Telecom Plan Analysis: Preliminary analysis of the plans based on a relatively small client selection. Analyze clients' behavior and determine which prepaid plan brings in more revenue. Conduct statistical hypothesis testing on profit from different plan users and different regions.
- Vehicle Sales Analysis: Analysis on what factors affect the price of a vehicle to be listed on a car sales website.
Tools: Pandas, Seaborn and Matplotlib, SciPy
-
- Ride Sharing App Analysis: Data Analysis on Chicago taxicab rides and weather reports to advise hypothetical ride-sharing company Zuber. Study a database, analyze data from competitors, and test hypothesis about the impact of weather on ride frequency.
Tools: Beautiful Soup, Requests, Pandas, PostgreSQL, SciPy.stats,NumPy
-
- Credit Score Analysis: Analyzing borrowers’ risk of defaulting. Prepare a report for a bank’s loan division to determine the likelihood that a customer defaults on a loan. Find out if a customer’s marital status and number of children has an impact on whether they will default on a loan.
Tools: Python, NLTK, WordNetLemmatizer, SnowballStemmer
-
- Ethnic-Identity-Development-Research-Project: Research of ethnic identity development by examining how three components of ethnic identity: ethnic search, achievement and commitment; change across the transition from high school to college.
- Effect of Anxiety on Self Estimates of Intelligence: Determine if higher levels of state anxiety would lead to lower self-estimates of intelligence among undergraduate students.
- Advanced Psych Research Methods Research Project: Experimental analysis to examine the relationship between phone use and the amount of sleep acquired
Tools: SPSS Statistics
Psychological Assessment Scales and Measures:
- Multigroup Ethnic Identity Measure (MEIM)
- State-Trait Anxiety Inventory (STAI)