Skip to content

In this repository, I present a collection of projects focused on data analysis and science, featuring real-world datasets and one fictitious dataset for the sake of practice. The projects showcase various data analysis and data science techniques and serve as practical examples, using Excel, Tableau, Power BI, Python, and SQL.

Notifications You must be signed in to change notification settings

AVC-prog/Data_Science_and_Analysis_with_Python_and_SQL

Repository files navigation

This repository showcases my work in data analysis and data science. Each project involves working with messy datasets, applying SQL and Python for data cleaning, and building predictive models to generate insights and solve problems.

It's important to mention that this is an ever-evolving repository, where the tasks presented may not be fully completed yet. However, work in progress will continue to be added over time.

I believe that the best way to improve is through trial and error, and as such, you may encounter mistakes or less-than-perfect solutions within the code. Rather than hiding them, I’ve intentionally left them in place to hold myself accountable.

Also, there are various approaches and comments highlighting what was done and assumed.

Skills & Techniques Used:

Data Preprocessing & Cleaning

  • Handling missing values (imputation, removal, interpolation)
  • Correcting inconsistent string values & data types
  • Feature engineering (creating new columns, transforming variables)
  • Handling outliers & scaling numerical features
  • Working (and creating) with datetime features for time series analysis

SQL & Database Management

  • Writing queries (joins, window functions, aggregations, common table expressions, stored procedures, transactions, and string manipulations)

Exploratory Data Analysis (EDA)

  • Visualizing distributions, correlations, and trends
  • Generating insights through graphs and statistical summaries
  • Detecting patterns & anomalies in data

Machine Learning Models

  • Supervised Learning: Linear Regression, Logistic Regression, Random Forest, Decision Tree, XGBoost, Gradient Boosting, KNN, and Neural Networks
  • Unsupervised Learning: K-Means Clustering and PCA for dimensionality reduction
  • Time Series Forecasting: ARIMA, SARIMA, and Exponential Smoothing

Project Overviews

Project 1: Pokémon

  • Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
  • Dataset: (https://github.com/KeithGalli/pandas) (csv file is also available in the Project 1 folder)
  • Key Tasks:
    • Performed data cleaning & feature engineering
    • Conducted exploratory data analysis (EDA)
    • Used machine learning models (I used all the models mentioned above and extrapolated which ones work and which don't for acquiring useful information)
    • Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
  • Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

Project 2: Finance

  • Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results. It was purposefully made with a small number of rows to emphasize that a large sample of data is necessary to make the machine learning models work properly, as those can be decieving in certain instances.
  • Dataset: ChatGPT generated data (available on the project 2 folder as a csv file)
  • Key Tasks:
    • Performed data cleaning & feature engineering
    • Conducted exploratory data analysis (EDA)
    • Used machine learning models (I used all the models mentioned above, except the time series ones, and extrapolated which ones work and which don't for acquiring useful information)
    • Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
  • Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

Project 3: Soccer Analysis

  • Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
  • Dataset: (csv files available in the Project 3 folder)
  • Key Tasks:
    • Performed data cleaning & feature engineering
    • Conducted exploratory data analysis (EDA)
    • Built and fine-tuned machine learning models (I used all the models mentioned above, except the time series ones, and extrapolated which ones work and which don't for acquiring useful information)
    • Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
  • Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

Project 4: Car Sales Analysis

  • Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
  • Dataset: (https://www.kaggle.com/datasets/safaeahb/car-sales-analysis-dashboard/data?select=car+sales.csv) (csv file is also available in the Project 4 folder)
  • Key Tasks:
    • Performed data cleaning & feature engineering
    • Conducted exploratory data analysis (EDA)
    • Built and fine-tuned machine learning models (I used all the models mentioned above, except the time series ones, and extrapolated which ones work and which don't for acquiring useful information)
    • Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
  • Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

Project 5: Healthcare Insurance Analysis

  • Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results

  • Dataset: (https://github.com/KeithGalli/Regression-Example) (csv file is also available in the Project 5 folder)

  • Key Tasks:

    • Performed data cleaning & feature engineering
    • Conducted exploratory data analysis (EDA)
    • Built and fine-tuned machine learning models (I used all the models mentioned above, except the time series ones, and extrapolated which ones work and which don't for acquiring useful information)
    • Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
  • Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

    Project 6: Online Retail Analysis

  • Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results

  • Dataset: (https://archive.ics.uci.edu/dataset/502/online+retail+ii))

  • Key Tasks:

    • Performed data cleaning & feature engineering
    • Conducted exploratory data analysis (EDA)
    • Built and fine-tuned machine learning models
    • Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
  • Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

Project 7: Telecommunications Analysis

  • Objective: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
  • Dataset: (https://github.com/harshbg/Telecom-Churn-Data-Analysis/blob/master/Telecom%20Churn.csv) (csv file is also available in the Project 7 folder)
  • Key Tasks:
    • Performed data cleaning & feature engineering
    • Conducted exploratory data analysis (EDA)
    • Used machine learning models (I used all the models mentioned above and extrapolated which ones work and which don't for acquiring useful information)
    • Applied SQL for data processing and analysis as an alternative to Python Pandas and Pyspark
  • Structure: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

Simple Models Folder

This folder contains implementations of commonly used machine learning models (which conrrespond to the headers), including:

  • Linear Regression
  • Logistic Regression
  • Random Forest, Decision Tree
  • Gradient Boosting
  • Neural Networks (Basic MLP)
  • K-Means
  • Principal Component Analysis (PCA)
  • Time Series Forecasting (ARIMA, SARIMA, Exponential Smoothing)

About

In this repository, I present a collection of projects focused on data analysis and science, featuring real-world datasets and one fictitious dataset for the sake of practice. The projects showcase various data analysis and data science techniques and serve as practical examples, using Excel, Tableau, Power BI, Python, and SQL.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published