# Experiment 3
 #### a.	Build a classifier (Decision Tree), compare its performance with an ensemble technique like random forest.|

### Step 1: Import Libraries

In [7]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


### Step 2: Load the Data

In [2]:

data = pd.read_csv('D:\data_science\SQL\diabetes.csv')

# Assume the target variable is the last column
X = data.iloc[:, :-1]  # Features (all columns except the last)
y = data.iloc[:, -1]   # Target (the last column)


### Step 3: Split the Data into Training and Testing Sets

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)


### Step 4: Train a Decision Tree Classifier and random forest

In [5]:
# Create the model
dt_model = DecisionTreeClassifier()
# Create the model
rf_model = RandomForestClassifier(n_estimators=100)

# Train the model
dt_model.fit(X_train, y_train)
# Train the model
rf_model.fit(X_train, y_train)


### Step 5: Evaluate the Decision Tree Model

In [6]:
# Make predictions
y_pred_dt = dt_model.predict(X_test)

# Calculate accuracy
accuracy_dt = accuracy_score(y_test, y_pred_dt)

# Print results
print("Decision Tree Classifier")
print("Accuracy:", accuracy_dt)
print("Classification Report:\n", classification_report(y_test, y_pred_dt))


Decision Tree Classifier
Accuracy: 0.6948051948051948
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.79      0.77       100
           1       0.57      0.52      0.54        54

    accuracy                           0.69       154
   macro avg       0.66      0.65      0.66       154
weighted avg       0.69      0.69      0.69       154



### Step 7: Evaluate the Random Forest Model

In [None]:
# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Calculate accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)

# Print results
print("Random Forest Classifier")
print("Accuracy:", accuracy_rf)
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

### Step 6: Train a Random Forest Classifier

# b.	Write a program on Support Vector Machine.
   #### SVM is a powerful classifier that works well with linear and non-linear data. Its performance can be evaluated using metrics like accuracy, precision, recall, and F1-score.

### Step 1: Import Libraries

In [10]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report


### Step 2: Load the Data

In [11]:
data = pd.read_csv('D:\data_science\SQL\diabetes.csv')

# Assume the target variable is the last column
X = data.iloc[:, :-1]  # Features (all columns except the last)
y = data.iloc[:, -1]   # Target (the last column)


### Step 3: Split the Data into Training and Testing Sets

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Step 4: Train an SVM Classifier

In [13]:
# Create the SVM model
svm_model = SVC(kernel='linear', random_state=42)

# Train the model
svm_model.fit(X_train, y_train)


### Step 5: Evaluate the SVM Model

In [14]:
# Make predictions
y_pred_svm = svm_model.predict(X_test)

# Calculate accuracy
accuracy_svm = accuracy_score(y_test, y_pred_svm)

# Print results
print("SVM Classifier")
print("Accuracy:", accuracy_svm)
print("Classification Report:\n", classification_report(y_test, y_pred_svm))


SVM Classifier
Accuracy: 0.7532467532467533
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.81      0.81        99
           1       0.65      0.65      0.65        55

    accuracy                           0.75       154
   macro avg       0.73      0.73      0.73       154
weighted avg       0.75      0.75      0.75       154



## AIM: Analysis of Weather Dataset on Multi-Node Cluster
 #### Description
### Objective:
The goal is to analyze a weather dataset to calculate metrics like the maximum temperature, minimum temperature, or average temperature for different locations or time periods. This analysis will be distributed across multiple nodes in a Hadoop cluster using the MapReduce framework.

#### Multi-Node Hadoop Cluster:
The analysis will run on a Hadoop cluster with multiple nodes, where the workload is distributed across different machines to process large datasets efficiently.

### Steps
#### 1 Prepare the Weather Dataset:
The dataset should be in a structured format, such as CSV, with columns like Date, Location, Temperature, Precipitation, etc.

#### 2 WeatherMapper:
Extracts the date, location, and temperature from each record in the dataset and emits a key-value pair where the key is the combination of date and location, and the value is the temperature.

#### 3 WeatherReducer:
Iterates through the temperature values for each key and determines the maximum temperature, then emits the result.

#### 4 Main Method:
Sets up and executes the MapReduce job on the Hadoop cluster, specifying input and output directories.

#### 5 Run the Program:
    Deploy the program on the multi-node Hadoop cluster.
    Execute the job using the Hadoop command line, specifying the input and output directories.
#### 6 View the Output:
The output will be stored in the specified output directory and can be viewed using Hadoop commands or downloaded for further analysis.

## CODE

In [None]:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WeatherAnalysis {
    public static class WeatherMapper extends Mapper<Object, Text, Text, FloatWritable> {
        private Text dateLocationKey = new Text();
        private FloatWritable temperatureValue = new FloatWritable();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String[] fields = value.toString().split(",");
            String date = fields[0];
            String location = fields[1];
            float temperature = Float.parseFloat(fields[2]);
            dateLocationKey.set(date + " " + location);
            temperatureValue.set(temperature);
            context.write(dateLocationKey, temperatureValue);
        }
    }

    public static class MaxTemperatureReducer extends Reducer<Text, FloatWritable, Text, FloatWritable> {
        private FloatWritable result = new FloatWritable();

        public void reduce(Text key, Iterable<FloatWritable> values, Context context) throws IOException, InterruptedException {
            float maxTemp = Float.MIN_VALUE;
            for (FloatWritable val : values) {
                maxTemp = Math.max(maxTemp, val.get());
            }
            result.set(maxTemp);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "weather analysis");
        job.setJarByClass(WeatherAnalysis.class);
        job.setMapperClass(WeatherMapper.class);
        job.setCombinerClass(MaxTemperatureReducer.class);
        job.setReducerClass(MaxTemperatureReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FloatWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}


### Output


#### Input file (Weather_data.csv):

Date        Location    Temperature
2024-08-09  New_York    90.0
2024-08-09  Los_Angeles 87.0
2024-08-09  Chicago     90.2
2024-08-10  New_York    95.5
2024-08-10  Los_Angeles 87.5
2024-08-10  Chicago     92.3
2024-08-11  New_York    96.7
2024-08-11  Los_Angeles 85.0
2024-08-11  Chicago     89.1


#### Output file (Part-r-0000):

2024-08-09 New_York    95.5
2024-08-09 Los_Angeles 87.5
2024-08-09 Chicago     90.2
2024-08-10 New_York    90.0
2024-08-10 Los_Angeles 87.0
2024-08-10 Chicago     92.3
2024-08-11 New_York    96.7
2024-08-11 Los_Angeles 85.0
2024-08-11 Chicago     89.0


### Aim:
To understand and use commands for viewing the content of files in Hadoop Distributed File System (HDFS).

### Description:
HDFS provides a set of commands to interact with the distributed file system, including commands to view the content of files. These commands allow users to access the data stored in HDFS without needing to download the entire file. Common commands include `hadoop fs -cat`, `hadoop fs -tail`, and `hadoop fs -head`, each serving different purposes for viewing the content of files.

### Common Commands and Formats:

1. **`hadoop fs -cat <file-path>`**
   - **Aim:** To display the entire content of a file in HDFS.
   - **Description:** This command prints the content of the specified file to the console. It is similar to the Unix `cat` command.
   - **Format:** 
     ```
     hadoop fs -cat <file-path>
     ```
   - **Example 1:** Display the content of a file named `example.txt` stored in the HDFS directory `/user/data/`.
     ```bash
     hadoop fs -cat /user/data/example.txt
     ```
   - **Example 2:** Display the content of a log file stored in the HDFS directory `/logs/app/`.
     ```bash
     hadoop fs -cat /logs/app/application.log
     ```

2. **`hadoop fs -tail <file-path>`**
   - **Aim:** To display the last 1 KB of the file content.
   - **Description:** This command shows the end of the file, which is particularly useful for checking the latest log entries in large log files.
   - **Format:** 
     ```
     hadoop fs -tail <file-path>
     ```
   - **Example 1:** Display the last part of a log file named `access.log` stored in the HDFS directory `/user/logs/`.
     ```bash
     hadoop fs -tail /user/logs/access.log
     ```
   - **Example 2:** Display the last part of a data file stored in the HDFS directory `/data/input/`.
     ```bash
     hadoop fs -tail /data/input/datafile.csv
     ```

### Additional Notes:
- **`hadoop fs -head <file-path>`**: This command displays the first 1 KB of the file content. It is useful for quickly checking the beginning of a file.

Using these commands, you can easily inspect files stored in HDFS, whether you need to see the entire file, the beginning, or just the latest entries.

To provide detailed steps based on your WeatherAnalysis program, I'll assume you're working with a Hadoop-based program for analyzing weather data. The process will be similar to the general Hadoop execution steps but tailored to your specific use case.

### Prerequisites:
- **Hadoop Installed**: Ensure Hadoop is installed and configured on your machine.
- **JDK Installed**: Ensure Java Development Kit (JDK) is installed.
- **Eclipse IDE**: Ensure Eclipse IDE is installed with necessary plugins for Java development.
- **Hadoop Libraries**: Ensure Hadoop libraries are downloaded and ready to be added to your project.

### Steps:

#### 1. **Set Up Eclipse for Your WeatherAnalysis Project:**
   - **Open Eclipse**: Launch Eclipse IDE.
   - **Create a New Java Project**:
     - Go to `File > New > Java Project`.
     - Name your project (e.g., `WeatherAnalysisProject`).
     - Click `Finish`.
   
   - **Add Hadoop Libraries to Your Project**:
     - Right-click on your project in the `Package Explorer` and select `Properties`.
     - Go to `Java Build Path > Libraries > Add External JARs`.
     - Navigate to the Hadoop installation directory (e.g., `/usr/local/hadoop/lib`) and select all necessary JAR files.
     - Click `Apply and Close`.

#### 2. **Create a New Java Class for WeatherAnalysis:**
   - **Create a Package**:
     - Right-click on `src` and select `New > Package`.
     - Name the package (e.g., `com.hadoop.weather`).

   - **Create the Java Class**:
     - Right-click on the package and select `New > Class`.
     - Name the class (e.g., `WeatherAnalysis`).
     - Click `Finish`.

   - **Write the WeatherAnalysis Code**:
     - Write your Hadoop program to analyze weather data. For example:
       ```java
       package com.hadoop.weather;

       import org.apache.hadoop.conf.Configuration;
       import org.apache.hadoop.fs.Path;
       import org.apache.hadoop.io.Text;
       import org.apache.hadoop.mapreduce.Job;
       import org.apache.hadoop.mapreduce.Mapper;
       import org.apache.hadoop.mapreduce.Reducer;
       import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
       import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

       import java.io.IOException;

       public class WeatherAnalysis {

           public static class WeatherMapper
               extends Mapper<Object, Text, Text, Text> {

               public void map(Object key, Text value, Context context)
                       throws IOException, InterruptedException {
                   String[] fields = value.toString().split(",");
                   String year = fields[0].substring(0, 4);
                   String temp = fields[1];
                   context.write(new Text(year), new Text(temp));
               }
           }

           public static class WeatherReducer
               extends Reducer<Text, Text, Text, Text> {

               public void reduce(Text key, Iterable<Text> values, Context context)
                       throws IOException, InterruptedException {
                   int maxTemp = Integer.MIN_VALUE;
                   for (Text value : values) {
                       maxTemp = Math.max(maxTemp, Integer.parseInt(value.toString()));
                   }
                   context.write(key, new Text(String.valueOf(maxTemp)));
               }
           }

           public static void main(String[] args) throws Exception {
               Configuration conf = new Configuration();
               Job job = Job.getInstance(conf, "Weather Analysis");
               job.setJarByClass(WeatherAnalysis.class);
               job.setMapperClass(WeatherMapper.class);
               job.setReducerClass(WeatherReducer.class);
               job.setOutputKeyClass(Text.class);
               job.setOutputValueClass(Text.class);
               FileInputFormat.addInputPath(job, new Path(args[0]));
               FileOutputFormat.setOutputPath(job, new Path(args[1]));
               System.exit(job.waitForCompletion(true) ? 0 : 1);
           }
       }
       ```
     - This example code assumes you're analyzing weather data to find the maximum temperature recorded each year.

#### 3. **Package the Program into a JAR File:**
   - **Export the Project as a JAR**:
     - Right-click on your project and select `Export`.
     - Choose `Java > JAR file` and click `Next`.
     - Select the resources to export, specify the export destination (e.g., `WeatherAnalysis.jar`), and click `Finish`.

   - **Specify the Main Class (Optional)**:
     - You can specify the main class by selecting `Runnable JAR file` in the export options or manually by editing the `MANIFEST.MF` file.

#### 4. **Copy the JAR File to the Hadoop Cluster:**
   - **Transfer the JAR File**:
     - Use `scp` or any file transfer method to copy the `WeatherAnalysis.jar` file to your Hadoop cluster.

#### 5. **Run the WeatherAnalysis Job:**
   - **Run the Job Using Hadoop Command Line**:
     - SSH into your Hadoop cluster.
     - Run the Hadoop job using the following command:
       ```bash
       hadoop jar /path/to/WeatherAnalysis.jar com.hadoop.weather.WeatherAnalysis /input/weatherdata /output/weatheranalysis
       ```
     - Replace `/path/to/WeatherAnalysis.jar` with the actual path of the JAR file on the cluster, `/input/weatherdata` with the HDFS input directory, and `/output/weatheranalysis` with the HDFS output directory.

#### 6. **Check the Output:**
   - **View the Output Files**:
     - Once the job is complete, view the output files stored in the HDFS output directory using:
       ```bash
       hadoop fs -ls /output/weatheranalysis
       hadoop fs -cat /output/weatheranalysis/part-r-00000
       ```
     - This will display the maximum temperature recorded each year, as calculated by your WeatherAnalysis program.

### Summary:
These steps outline the process of setting up, coding, packaging, deploying, and executing your WeatherAnalysis program using Hadoop and Eclipse. After running the job, you can inspect the results stored in HDFS to verify the success of your analysis.