This repository showcases my work in data analysis and data science. Each project involves working with messy datasets, applying SQL and Python for data cleaning, and building predictive models to generate insights and solve problems.
It's important to mention that this is an ever-evolving repository, where the tasks presented may not be fully completed yet. However, work in progress will continue to be added over time.
I believe that the best way to improve is through trial and error, and as such, you may encounter mistakes or less-than-perfect solutions within the code. Rather than hiding them, I’ve intentionally left them in place to hold myself accountable.
Also, there are various approaches and comments highlighting what was done and assumed.
- Handling missing values (imputation, removal, interpolation)
- Correcting inconsistent string values & data types
- Feature engineering (creating new columns, transforming variables)
- Handling outliers & scaling numerical features
- Working (and creating) with datetime features for time series analysis
- Writing queries (joins, window functions, aggregations, common table expressions, stored procedures, transactions, and string manipulations)
- Visualizing distributions, correlations, and trends
- Generating insights through graphs and statistical summaries
- Detecting patterns & anomalies in data
- Supervised Learning: Linear Regression, Logistic Regression, Random Forest, Decision Tree, XGBoost, Gradient Boosting, KNN, and Neural Networks
- Unsupervised Learning: K-Means Clustering and PCA for dimensionality reduction
- Time Series Forecasting: ARIMA, SARIMA, and Exponential Smoothing
- Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
- Dataset: (https://github.com/KeithGalli/pandas) (csv file is also available in the Project 1 folder)
- Key Tasks:
- Performed data cleaning & feature engineering
- Conducted exploratory data analysis (EDA)
- Used machine learning models (I used all the models mentioned above and extrapolated which ones work and which don't for acquiring useful information)
- Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
- Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL
- Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results. It was purposefully made with a small number of rows to emphasize that a large sample of data is necessary to make the machine learning models work properly, as those can be decieving in certain instances.
- Dataset: ChatGPT generated data (available on the project 2 folder as a csv file)
- Key Tasks:
- Performed data cleaning & feature engineering
- Conducted exploratory data analysis (EDA)
- Used machine learning models (I used all the models mentioned above, except the time series ones, and extrapolated which ones work and which don't for acquiring useful information)
- Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
- Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL
- Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
- Dataset: (csv files available in the Project 3 folder)
- Key Tasks:
- Performed data cleaning & feature engineering
- Conducted exploratory data analysis (EDA)
- Built and fine-tuned machine learning models (I used all the models mentioned above, except the time series ones, and extrapolated which ones work and which don't for acquiring useful information)
- Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
- Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL
- Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
- Dataset: (https://www.kaggle.com/datasets/safaeahb/car-sales-analysis-dashboard/data?select=car+sales.csv) (csv file is also available in the Project 4 folder)
- Key Tasks:
- Performed data cleaning & feature engineering
- Conducted exploratory data analysis (EDA)
- Built and fine-tuned machine learning models (I used all the models mentioned above, except the time series ones, and extrapolated which ones work and which don't for acquiring useful information)
- Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
- Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL
-
Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
-
Dataset: (https://github.com/KeithGalli/Regression-Example) (csv file is also available in the Project 5 folder)
-
Key Tasks:
- Performed data cleaning & feature engineering
- Conducted exploratory data analysis (EDA)
- Built and fine-tuned machine learning models (I used all the models mentioned above, except the time series ones, and extrapolated which ones work and which don't for acquiring useful information)
- Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
-
Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL
-
Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
-
Dataset: (https://archive.ics.uci.edu/dataset/502/online+retail+ii))
-
Key Tasks:
- Performed data cleaning & feature engineering
- Conducted exploratory data analysis (EDA)
- Built and fine-tuned machine learning models
- Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
-
Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL
- Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
- Dataset: (https://github.com/harshbg/Telecom-Churn-Data-Analysis/blob/master/Telecom%20Churn.csv) (csv file is also available in the Project 7 folder)
- Key Tasks:
- Performed data cleaning & feature engineering
- Conducted exploratory data analysis (EDA)
- Used machine learning models (I used all the models mentioned above and extrapolated which ones work and which don't for acquiring useful information)
- Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
- Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL
This folder contains implementations of commonly used machine learning models (which conrrespond to the headers), including:
- Linear Regression
- Logistic Regression
- Random Forest, Decision Tree
- Gradient Boosting
- Neural Networks (Basic MLP)
- K-Means
- Principal Component Analysis (PCA)
- Time Series Forecasting (ARIMA, SARIMA, Exponential Smoothing)