- Dealing with large datasets and diverse data sources can be challenging when applying traditional machine learning techniques.
- Spark, a distributed processing engine utilizing the MapReduce framework, addresses these challenges in big data processing.
This project focuses on Classification and Clustering in Spark MLlib using Airlines Data.
- Implementation includes Decision tree classifier, Random forest classifier, and K-Means clustering algorithms.
s3://airlines123/airline/data.zip
- Language:
Python
- Package:
Pyspark
- Services:
Spark
-
File Names:
DecisionTree.ipynb
RandomForest.ipynb
K_means.ipynb
-
Datasets:
data.zip
Social_Network_Ads.csv
-
Execute using Python script:
<spark_path> spark-submit <file_path>
<spark_path>
: Path to Spark installation<file_path>
: Path to the script file
Example:
<C:\Users\admin\Desktop\spark\bin>spark-submit C:\Users\admin\Desktop\sparkml\DecisionTree.py>
- Modular Code
- Create a virtual environment
- Install requirements:
pip install -r requirements.txt
- Run code:
python DecisionTree.py
- Check output for all visualizations