Skip to content

Implemented Classification and Clustering models using Spark MLlib on Airlines Data.

License

Notifications You must be signed in to change notification settings

AjNavneet/Airlines-Classification-Clustering-PySparkML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airlines Classification using PySpark ML

Spark ML

  • Dealing with large datasets and diverse data sources can be challenging when applying traditional machine learning techniques.
  • Spark, a distributed processing engine utilizing the MapReduce framework, addresses these challenges in big data processing.

Objective

This project focuses on Classification and Clustering in Spark MLlib using Airlines Data.

  • Implementation includes Decision tree classifier, Random forest classifier, and K-Means clustering algorithms.

Business Overview of Airlines Industry


S3 Link for Dataset

  • s3://airlines123/airline/data.zip

Tech Stack

  • Language: Python
  • Package: Pyspark
  • Services: Spark

Code Overview

  • File Names:

    • DecisionTree.ipynb
    • RandomForest.ipynb
    • K_means.ipynb
  • Datasets:

    • data.zip
    • Social_Network_Ads.csv

Steps to Run

Command Prompt

  1. Execute using Python script:

    <spark_path> spark-submit <file_path>
    
    • <spark_path>: Path to Spark installation
    • <file_path>: Path to the script file

    Example:

    <C:\Users\admin\Desktop\spark\bin>spark-submit C:\Users\admin\Desktop\sparkml\DecisionTree.py>
    

IPython

  1. Modular Code
    • Create a virtual environment
    • Install requirements: pip install -r requirements.txt
    • Run code: python DecisionTree.py
    • Check output for all visualizations

About

Implemented Classification and Clustering models using Spark MLlib on Airlines Data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published