Skip to content

anantppatnaik/Python-Data-Analysis-projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

✅ README 1 — Online News Popularity (Shares Prediction)

Domain: Social Media Goal: Predict whether an article will be popular (shares ≥ 1400) or unpopular.

Project Overview

Mashable collected metadata for ~40,000 online articles. The task is to build a classification model that predicts if an article will go viral based on its content features, sentiment, metadata, keywords, and publishing time.

Dataset

File: OnlineNewsPopularity.csv Rows: ~39,644 Features: 60+ Includes:

Content statistics (title length, word count, token stats)

Sentiment polarity metrics

Channel/category indicators

Day-of-week publishing flags

Keyword & link metadata

Target: shares

Articles were labeled as:

1 → shares ≥ 1400 (popular)

0 → shares < 1400 (unpopular)

Key Steps

  1. Data Exploration

Checked dataset structure using .info(), .describe(), .head()

Built a correlation heatmap

Visualized publishing day vs popularity

Visualized article categories vs popularity distribution

  1. Feature Engineering

Log-transform applied to skewed numerical columns

Kept sentiment-related features unchanged

Dropped unsupported columns (URL, timedelta)

  1. Train–Test Split train_test_split(data[features], data['shares'], test_size=0.25, random_state=0)

  2. Models Trained

Random Forest Classifier

Bernoulli Naive Bayes

k-Nearest Neighbors

Support Vector Machine (RBF kernel)

Results Model Accuracy Random Forest 0.6649 KNN 0.6070 Naive Bayes 0.5924 SVM (RBF, probability=True) ~0.63 (varies by run)

Conclusion: Random Forest offers the best performance without heavy tuning.

Tools & Libraries

Python

pandas, numpy

seaborn, matplotlib

scikit-learn

Next Improvements

Hyperparameter tuning (GridSearchCV)

XGBoost for higher accuracy

Balanced dataset handling

Advanced NLP-based embeddings (TF-IDF on title/content)


✅ README 2 — Avazu / CTR Prediction Project (Your First Code Block)

Domain: Digital Advertising / Machine Learning Goal: Predict the click-through rate (CTR) of ads using historical ad features.

Project Overview

The objective is to predict whether an ad will be clicked based on categorical and numerical attributes available in the ad logs.

Dataset

Input: Training CSV with features describing ad interactions

Target: click (0 = no click, 1 = clicked)

Dataset characteristics:

Large volume

Mostly categorical

Typical for CTR prediction challenges

Key Steps

  1. Data Preparation

Loaded training and test data using pandas

Encoded target labels using LabelEncoder

Selected numerical and categorical columns for modeling

Converted categories with appropriate encoding

  1. Model Training

Evaluated four ML algorithms:

RandomForestClassifier

DecisionTreeClassifier

NaiveBayes

Support Vector Machine (SVC)

  1. Evaluation

Metrics used:

Accuracy

Log Loss

Workflow:

model.predict(X_test) model.predict_proba(X_test) log_loss(y_test, pred)

  1. Best Model

Random Forest achieved the highest accuracy and lower log-loss compared to other models.

Tools Used

pandas, numpy

scikit-learn (RF, SVM, NB, DT)

matplotlib

Future Enhancements

One-hot encoding for high-cardinality categories

Feature hashing

XGBoost or LightGBM

Hyperparameter tuning

Calibration to improve predicted probabilities


✅ README 3 — Salary Prediction Using Supervised Learning

Domain: Human Resources / ML Regression Goal: Predict employee salary based on skills, experience, industry, and demographic features.

Project Overview

This project builds a regression model to estimate a professional’s salary using structured input features. The work includes data cleaning, visualization, model building, and evaluation.

Dataset

Contains:

Education details

Job title / seniority

Experience

Location

Industry

Salary (target variable)

Key Steps

  1. Data Cleaning

Removed missing or inconsistent entries

Encoded categorical features

Normalized numerical variables

  1. Exploratory Data Analysis (EDA)

Correlation heatmap

Boxplots and distribution visualizations

Outlier detection

Trends by role, experience and education

  1. Modeling

Trained multiple regression models:

Linear Regression

Decision Tree Regressor

Random Forest Regressor

Support Vector Regression (SVR)

Evaluated using:

MAE

RMSE

R² Score

  1. Best Model

Random Forest Regressor produced the most stable performance with lowest error.

Libraries Used

pandas

numpy

seaborn, matplotlib

scikit-learn

Possible Improvements

Hyperparameter tuning

Cross-validation

Feature selection based on SHAP

Using CatBoost or LightGBM

If you'd like, I can also:

✅ Combine all 3 into a single portfolio README ✅ Add GitHub folder structure ✅ Generate badges (accuracy, dataset size, language) ✅ Create a polished "Project Summary" for LinkedIn

Just tell me!

About

Sales project of a FMCG

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors