Skip to content

Repository containing portfolio of data analysis and machine learning projects

Notifications You must be signed in to change notification settings

gguillau/Data-Science-Portfolio

Repository files navigation

Data Science & Analytics Project Portfolio


Repository containing portfolio of ongoing and completed data science projects completed by me for the Practicum Data Science bootcamp and academic, self learning, and hobby purposes.

Following are the highlights of the projects:

  • Languages/Software: Python, PostgreSQL, HTML, CSS, SPSS Statistics, Microsoft Excel/Access, Qualtrics Surveys, Google Analytics, Adobe Analytics

  • Tools: pandas, NumPy, seaborn, matplotlib, plotly, sciPy, scikit-learn, TensorFlow, Keras, geopandas, folium, langid, Beautiful Soup, Selenium, transformers, NLP (NLTK, spaCy, BERT), librosa, spotipy, PySpark, electronJs, shapely, SSH, SFTP, Unix/Linux

  • Machine Learning Models Evaluated:
    • Classification Models: DecisionTreeClassifier, RandomForestClassifier, KNeighborsClassifier, AdaBoostClassifier, XGBClassifier, CatBoostClassifier, LGBMClassifier, k-NNClassifier
    • Regression Models: LinearRegressor, DecisionTreeRegressor, RandomForestRegressor, LogisticRegression, XGBRegressor, LGBMRegressor, CatBoostRegressor
  • Psychological Assessment Scales and Measures:
    • Multigroup Ethnic Identity Measure (MEIM)
    • State-Trait Anxiety Inventory (STAI)

Relevant Experience:

Contributed research to the company's infrastructure, with the goal of training a deep learning model using BERT to predict user geolocation from individual tweets. Yachay is an open-source Machine Learning community that has collected decades worth of useful natural language data from various sources.

Highlights:

  • Identified and gathered relevant data from various sources.
  • Performed exploratory data analysis to gain insights into the data
  • Utilized Hugging Face NLP pipelines to extract various text features, including sentiment analysis, topic identification, and language detection
  • Preprocessed the text data (e.g., tokenization, removing stop words) using BERT for model training
  • Evaluated model performance using appropriate MSE and haversine loss
  • Median and mean differences between predicted and actual distances were 1,334 km and 1,881 km, respectively, demonstrating the model's accuracy.
  • Acknowledged the workflow's potential to provide valuable geolocation prediction capabilities, with the possibility of scaling and integrating it into the existing infrastructure for real-time application

Tools: Python, seaborn, folium, scikit-learn, langid, geopandas, tensorflow, BERT


Tasked with developing a Python-based regression model to predict the valence of pop songs for playlist curation and other applications. An automatic method of classifying the valence of pop songs is useful for playlist curation and other applications.

Highlights:

  • Data collection and extraction using Spotify’s Web API
  • Audio feature extraction and analysis from mp3 files using Librosa (python package for music analysis)
  • Regression analysis to predict valence using both song lyrics (NLP) and audio features as input
  • Implemented various approaches to train and validate models to forecast valence scores of songs
  • Conducted model training, validation, and hyperparameter optimization using RandomSearchCV on four regression models: Random Forest, K-Nearest Neighbors, XGBoost, and Linear Regression.
  • Selected Support Vector Regression model trained on 191 normalized audio features, achieving an RMSE score of 0.16 on testing data, meeting the company's desired model performance standards

Tools: pandas, numPy, matplotlib, seaborn, spotipy, transformers, sklearn, Librosa


Researching grant prospects can be time-consuming and overwhelming. Develop an automation system for a nonprofit organization (DataReady DFW) to find available grant opportunities and fill out applications with little or no human intervention.

Highlights:

  • Database management; collected, analyzed, and interpreted raw public grant data
  • Developed and implemented scripts to autofill constant values
  • Collection, reporting, and analysis of website data
  • Filtered/trained script to select further grant opportunities using Natural Language Processing

Tools: TensorFlow, Beautiful Soup, Selenium, pandas, ntlk, Google Analytics


Technical Projects:

Hands-on Experience Project Technical skills
Unsupervised Learning Client Churn Forecast Machine learning algorithms, XGBoost, CatBoost, LightGBM
Computer Vision (CV) Computer Vision Age Detection w/ deep learning Tensorflow
Natural Language Processing (NLP) IMDB Movie Sentiment Analysis using NLP SGDClassifier, Naïve bayes, LightGBM, spaCy, TF-IDF, BERT
Time Series Analysis Time Series Forecast Time Series Analysis, CatBoost, LightGBM, XGBoost
Machine Learning in Business Gold Recovery Regression Model Python, Scikit-learn, LinearRegression
Numerical Methods with ML Vehicle Market Value Prediction w/ gradient boosting Numerical Methods, CatBoost, LightGBM, XGBoost
Linear Algebra with Machine Learning Insurance Benefits Predictive Model Scikit-learn, Linear Algebra, k-Nearest Neighbors
Machine Learning - Classification Telecom Plan Classification Model Python, Scikit-learn, Pandas
Supervised Learning - Prediction Bank Customer Churn Prediction Scikit-learn, XGBoost, GridSearchCV, AdaBoost
Machine Learning in Business Oil Well Regression Model Python, Scikit-learn, Bootstrapping, LinearRegression
Webscraping and Data Storage Data Collection, Webscraping and Storage PostgreSQL, Python, BeautifulSoup, Seaborn, Matplotlib, ETL (extract, transform and load)
Data Visualization and Storytelling with Data Video Game Market Analysis Python, Pandas, Squarify, Seaborn, Matplotlib
Data Preprocessing Credit Score Analysis Python, NLTK, WordNetLemmatizer, SnowballStemmer, Seaborn, Matplotlib
Exploratory Data Analysis (EDA) Vehicle Market Analysis Pandas, Matplotlib
Statistical Data Analysis (SDA) Telecom Customer Data Analysis Python, pandas, Numpy, SciPy, Seaborn, Matplotlib
  • Machine Learning (Python)

    • Client Churn Forecast w/ Machine Learning: Forecast the churn of clients for a telecom company. Collect the necessary information to assist the marketing department in figuring out different ways of retaining clients.
    • Computer Vision Age Detection: Built a regression model for computer vision, in order to predict the approximate age of a customer from a Supermarket checkout photograph.
    • IMDB Movie Sentiment Analysis using NLP: Build a machine learning model to automatically detect negative reviews for a system used to filter and categorize movie reviews.
    • Time Series Forecast: Use historical data on taxi orders at airports to create a model that predicts the number of taxi orders for the next hour.
    • Vehicle Market Value Prediction: Determine the market value of a used car using historical car data. Identify the quality and speed of prediction for various models.
    • Insurance Benefits Predictive Model: Linear regression for insurance benefits prediction
    • Gold Recovery Regression Model: Prepare a prototype of a machine learning model for Zyfra to predict the amount of gold recovered from the gold ore for the purpose of optimizing production and eliminating unprofitable parameters. The company develops efficiency solutions for the heavy industry.
    • Telecom Plan Classification Model: Build machine learning model to identify the right plan for each subscriber based on their behavior, using the historical data available.
    • Bank Customer Churn Prediction Model: Creating a classification model to predict customer churn for a bank from an imbalanced dataset.
    • Oil Well Regression Model: Build a Linear Regression model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrapping technique.

Tools: sklearn, Pandas, Seaborn, Matplotlib, TensorFlow, PIL (Python Imaging Library)

  • Statistical Data Analysis and Visualisation (Python)

    • Video Game Market Analysis: Data Analysis identifying what factors make a video game succeed. Identify patterns in historical game sales data, analyze metrics for each video game platform, and conduct statistical hypothesis testing to find potential big winners and plan advertising campaigns.
    • Telecom Plan Analysis: Preliminary analysis of the plans based on a relatively small client selection. Analyze clients' behavior and determine which prepaid plan brings in more revenue. Conduct statistical hypothesis testing on profit from different plan users and different regions.
    • Vehicle Sales Analysis: Analysis on what factors affect the price of a vehicle to be listed on a car sales website.

Tools: Pandas, Seaborn and Matplotlib, SciPy

  • Data Collection and Storage (Python and PostgreSQL)

    • Ride Sharing App Analysis: Data Analysis on Chicago taxicab rides and weather reports to advise hypothetical ride-sharing company Zuber. Study a database, analyze data from competitors, and test hypothesis about the impact of weather on ride frequency.

Tools: Beautiful Soup, Requests, Pandas, PostgreSQL, SciPy.stats,NumPy

  • Data Preprocessing (Python)

    • Credit Score Analysis: Analyzing borrowers’ risk of defaulting. Prepare a report for a bank’s loan division to determine the likelihood that a customer defaults on a loan. Find out if a customer’s marital status and number of children has an impact on whether they will default on a loan.

Tools: Python, NLTK, WordNetLemmatizer, SnowballStemmer

Undergraduate Research Projects (University at Buffalo)

Tools: SPSS Statistics

Psychological Assessment Scales and Measures:

  • Multigroup Ethnic Identity Measure (MEIM)
  • State-Trait Anxiety Inventory (STAI)

About

Repository containing portfolio of data analysis and machine learning projects

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published