Data Science & Analytics Project Portfolio

Repository containing portfolio of ongoing and completed data science projects completed by me for the Practicum Data Science bootcamp and academic, self learning, and hobby purposes.

Following are the highlights of the projects:

Languages/Software: Python, PostgreSQL, HTML, CSS, SPSS Statistics, Microsoft Excel/Access, Qualtrics Surveys, Google Analytics, Adobe Analytics
Tools: pandas, NumPy, seaborn, matplotlib, plotly, sciPy, scikit-learn, TensorFlow, Keras, geopandas, folium, langid, Beautiful Soup, Selenium, transformers, NLP (NLTK, spaCy, BERT), librosa, spotipy, PySpark, electronJs, shapely, SSH, SFTP, Unix/Linux

Machine Learning Models Evaluated:
- Classification Models: DecisionTreeClassifier, RandomForestClassifier, KNeighborsClassifier, AdaBoostClassifier, XGBClassifier, CatBoostClassifier, LGBMClassifier, k-NNClassifier
- Regression Models: LinearRegressor, DecisionTreeRegressor, RandomForestRegressor, LogisticRegression, XGBRegressor, LGBMRegressor, CatBoostRegressor

Psychological Assessment Scales and Measures:
- Multigroup Ethnic Identity Measure (MEIM)
- State-Trait Anxiety Inventory (STAI)

Relevant Experience:

Tweet Geolocation Prediction - Yachay.ai | Junior NLP Engineer

Contributed research to the company's infrastructure, with the goal of training a deep learning model using BERT to predict user geolocation from individual tweets. Yachay is an open-source Machine Learning community that has collected decades worth of useful natural language data from various sources.

Highlights:

Identified and gathered relevant data from various sources.
Performed exploratory data analysis to gain insights into the data
Utilized Hugging Face NLP pipelines to extract various text features, including sentiment analysis, topic identification, and language detection
Preprocessed the text data (e.g., tokenization, removing stop words) using BERT for model training
Evaluated model performance using appropriate MSE and haversine loss
Median and mean differences between predicted and actual distances were 1,334 km and 1,881 km, respectively, demonstrating the model's accuracy.
Acknowledged the workflow's potential to provide valuable geolocation prediction capabilities, with the possibility of scaling and integrating it into the existing infrastructure for real-time application

Tools: Python, seaborn, folium, scikit-learn, langid, geopandas, tensorflow, BERT

Song Valence Prediction - Cuetessa,inc | Junior Data Scientist

Tasked with developing a Python-based regression model to predict the valence of pop songs for playlist curation and other applications. An automatic method of classifying the valence of pop songs is useful for playlist curation and other applications.

Highlights:

Data collection and extraction using Spotify’s Web API
Audio feature extraction and analysis from mp3 files using Librosa (python package for music analysis)
Regression analysis to predict valence using both song lyrics (NLP) and audio features as input
Implemented various approaches to train and validate models to forecast valence scores of songs
Conducted model training, validation, and hyperparameter optimization using RandomSearchCV on four regression models: Random Forest, K-Nearest Neighbors, XGBoost, and Linear Regression.
Selected Support Vector Regression model trained on 191 normalized audio features, achieving an RMSE score of 0.16 on testing data, meeting the company's desired model performance standards

Tools: pandas, numPy, matplotlib, seaborn, spotipy, transformers, sklearn, Librosa

Grant Automated Web Scraper - DataReadyDFW | Data/Web Analytics Intern

Researching grant prospects can be time-consuming and overwhelming. Develop an automation system for a nonprofit organization (DataReady DFW) to find available grant opportunities and fill out applications with little or no human intervention.

Highlights:

Database management; collected, analyzed, and interpreted raw public grant data
Developed and implemented scripts to autofill constant values
Collection, reporting, and analysis of website data
Filtered/trained script to select further grant opportunities using Natural Language Processing

Tools: TensorFlow, Beautiful Soup, Selenium, pandas, ntlk, Google Analytics

Technical Projects:

Hands-on Experience	Project	Technical skills
Unsupervised Learning	Client Churn Forecast	Machine learning algorithms, XGBoost, CatBoost, LightGBM
Computer Vision (CV)	Computer Vision Age Detection w/ deep learning	Tensorflow
Natural Language Processing (NLP)	IMDB Movie Sentiment Analysis using NLP	SGDClassifier, Naïve bayes, LightGBM, spaCy, TF-IDF, BERT
Time Series Analysis	Time Series Forecast	Time Series Analysis, CatBoost, LightGBM, XGBoost
Machine Learning in Business	Gold Recovery Regression Model	Python, Scikit-learn, LinearRegression
Numerical Methods with ML	Vehicle Market Value Prediction w/ gradient boosting	Numerical Methods, CatBoost, LightGBM, XGBoost
Linear Algebra with Machine Learning	Insurance Benefits Predictive Model	Scikit-learn, Linear Algebra, k-Nearest Neighbors
Machine Learning - Classification	Telecom Plan Classification Model	Python, Scikit-learn, Pandas
Supervised Learning - Prediction	Bank Customer Churn Prediction	Scikit-learn, XGBoost, GridSearchCV, AdaBoost
Machine Learning in Business	Oil Well Regression Model	Python, Scikit-learn, Bootstrapping, LinearRegression
Webscraping and Data Storage	Data Collection, Webscraping and Storage	PostgreSQL, Python, BeautifulSoup, Seaborn, Matplotlib, ETL (extract, transform and load)
Data Visualization and Storytelling with Data	Video Game Market Analysis	Python, Pandas, Squarify, Seaborn, Matplotlib
Data Preprocessing	Credit Score Analysis	Python, NLTK, WordNetLemmatizer, SnowballStemmer, Seaborn, Matplotlib
Exploratory Data Analysis (EDA)	Vehicle Market Analysis	Pandas, Matplotlib
Statistical Data Analysis (SDA)	Telecom Customer Data Analysis	Python, pandas, Numpy, SciPy, Seaborn, Matplotlib

Machine Learning (Python)
- Client Churn Forecast w/ Machine Learning: Forecast the churn of clients for a telecom company. Collect the necessary information to assist the marketing department in figuring out different ways of retaining clients.
- Computer Vision Age Detection: Built a regression model for computer vision, in order to predict the approximate age of a customer from a Supermarket checkout photograph.
- IMDB Movie Sentiment Analysis using NLP: Build a machine learning model to automatically detect negative reviews for a system used to filter and categorize movie reviews.
- Time Series Forecast: Use historical data on taxi orders at airports to create a model that predicts the number of taxi orders for the next hour.
- Vehicle Market Value Prediction: Determine the market value of a used car using historical car data. Identify the quality and speed of prediction for various models.
- Insurance Benefits Predictive Model: Linear regression for insurance benefits prediction
- Gold Recovery Regression Model: Prepare a prototype of a machine learning model for Zyfra to predict the amount of gold recovered from the gold ore for the purpose of optimizing production and eliminating unprofitable parameters. The company develops efficiency solutions for the heavy industry.
- Telecom Plan Classification Model: Build machine learning model to identify the right plan for each subscriber based on their behavior, using the historical data available.
- Bank Customer Churn Prediction Model: Creating a classification model to predict customer churn for a bank from an imbalanced dataset.
- Oil Well Regression Model: Build a Linear Regression model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrapping technique.

Tools: sklearn, Pandas, Seaborn, Matplotlib, TensorFlow, PIL (Python Imaging Library)

Statistical Data Analysis and Visualisation (Python)
- Video Game Market Analysis: Data Analysis identifying what factors make a video game succeed. Identify patterns in historical game sales data, analyze metrics for each video game platform, and conduct statistical hypothesis testing to find potential big winners and plan advertising campaigns.
- Telecom Plan Analysis: Preliminary analysis of the plans based on a relatively small client selection. Analyze clients' behavior and determine which prepaid plan brings in more revenue. Conduct statistical hypothesis testing on profit from different plan users and different regions.
- Vehicle Sales Analysis: Analysis on what factors affect the price of a vehicle to be listed on a car sales website.

Tools: Pandas, Seaborn and Matplotlib, SciPy

Data Collection and Storage (Python and PostgreSQL)
- Ride Sharing App Analysis: Data Analysis on Chicago taxicab rides and weather reports to advise hypothetical ride-sharing company Zuber. Study a database, analyze data from competitors, and test hypothesis about the impact of weather on ride frequency.

Tools: Beautiful Soup, Requests, Pandas, PostgreSQL, SciPy.stats,NumPy

Data Preprocessing (Python)
- Credit Score Analysis: Analyzing borrowers’ risk of defaulting. Prepare a report for a bank’s loan division to determine the likelihood that a customer defaults on a loan. Find out if a customer’s marital status and number of children has an impact on whether they will default on a loan.

Tools: Python, NLTK, WordNetLemmatizer, SnowballStemmer

Undergraduate Research Projects (University at Buffalo)

Quantitative Research Analysis
- Ethnic-Identity-Development-Research-Project: Research of ethnic identity development by examining how three components of ethnic identity: ethnic search, achievement and commitment; change across the transition from high school to college.
- Effect of Anxiety on Self Estimates of Intelligence: Determine if higher levels of state anxiety would lead to lower self-estimates of intelligence among undergraduate students.
- Advanced Psych Research Methods Research Project: Experimental analysis to examine the relationship between phone use and the amount of sleep acquired

Tools: SPSS Statistics

Psychological Assessment Scales and Measures:

Multigroup Ethnic Identity Measure (MEIM)
State-Trait Anxiety Inventory (STAI)

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
Computer-Vision		Computer-Vision
Linear-Algebra-Project		Linear-Algebra-Project
ML-for-Texts		ML-for-Texts
Numerical-Methods-Project		Numerical-Methods-Project
Time-Series-Project		Time-Series-Project
Unsupervised-Learning		Unsupervised-Learning
images		images
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science & Analytics Project Portfolio

Relevant Experience:

Tweet Geolocation Prediction - Yachay.ai | Junior NLP Engineer

Song Valence Prediction - Cuetessa,inc | Junior Data Scientist

Grant Automated Web Scraper - DataReadyDFW | Data/Web Analytics Intern

Technical Projects:

Machine Learning (Python)

Statistical Data Analysis and Visualisation (Python)

Data Collection and Storage (Python and PostgreSQL)

Data Preprocessing (Python)

Undergraduate Research Projects (University at Buffalo)

Quantitative Research Analysis

About

Releases

Packages

Languages

gguillau/Data-Science-Portfolio

Folders and files

Latest commit

History

Repository files navigation

Data Science & Analytics Project Portfolio

Relevant Experience:

Tweet Geolocation Prediction - Yachay.ai | Junior NLP Engineer

Song Valence Prediction - Cuetessa,inc | Junior Data Scientist

Grant Automated Web Scraper - DataReadyDFW | Data/Web Analytics Intern

Technical Projects:

Machine Learning (Python)

Statistical Data Analysis and Visualisation (Python)

Data Collection and Storage (Python and PostgreSQL)

Data Preprocessing (Python)

Undergraduate Research Projects (University at Buffalo)

Quantitative Research Analysis

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages