# <h1 style="text-align:center;">Machine Learning</h1>


<h2> Installing Anaconda and Python : </h1>


1. **Download Anaconda**:
   - Open your web browser and go to the Anaconda download page.
   - Choose the appropriate version (Windows, Linux, or macOS).
   - Download the latest Python 3.7 version.


2. **Install Anaconda**:
   - Run the downloaded Anaconda installer (e.g., Anaconda3-2019.03-Windows-x86_64.exe).
   - Follow the installation wizard.
     - Agree to the license agreement.
     - Select the installation options (e.g., "Just me" for individual use).
     - Choose the installation location (you can leave it as the default).
     - Complete the installation.


3. **Open Anaconda Navigator**:
   - Use the Windows Start menu to search for "Anaconda Navigator."
   - Launch Anaconda Navigator.


4. **Launch Spyder IDE**:
   - In Anaconda Navigator, click on the "Launch" button next to Spyder.
   - This will install the Spyder IDE on your system.


5. **Run Python Programs in Spyder**:
   - Open Spyder IDE, and it will provide a Python programming environment.
   - Write your Python programs in Spyder.
   - Save your program with a .py extension.
   - Run the program using the "Run" button.
   - Check the program's output in the console pane at the bottom right.


6. **Close Spyder IDE**:
   - When you're done, you can close the Spyder IDE.


This installation process provides you with Anaconda, which includes Python and various IDEs (including Spyder) for running and developing Python programs, making it a convenient solution for machine learning and data science.

<h2> Artificial Intelligence Vs Machine Learning : </h2>


**Artificial Intelligence (AI):**
- AI is a technology that enables machines to simulate human behavior.
- The goal of AI is to create smart computer systems that can solve complex problems, similar to humans.
- AI aims to make intelligent systems that can perform various complex tasks.
- It has a wide scope and includes a broad range of applications, such as Siri, customer support chatbots, expert systems, and intelligent humanoid robots.
- AI can be categorized into Weak AI, General AI, and Strong AI based on capabilities.
- AI involves learning, reasoning, and self-correction.
- AI deals with structured, semi-structured, and unstructured data.

**Machine Learning (ML):**
- ML is a subset of AI that allows machines to automatically learn from past data without explicit programming.
- The goal of ML is to enable machines to learn from data and provide accurate outputs.
- ML teaches machines to perform specific tasks for which they are trained.
- It has a limited scope compared to AI and mainly focuses on specific task-based applications.
- ML includes subcategories like supervised learning, unsupervised learning, and reinforcement learning.
- ML involves learning and self-correction when introduced with new data.
- ML primarily deals with structured and semi-structured data.

In summary, AI is a broader concept focused on creating intelligent systems that can perform various complex tasks, while ML is a subset of AI that specifically deals with teaching machines to learn from data and perform specific tasks accurately.

<h2>How to get datasets for Machine Learning : </h2>


**What is a Dataset:**
- A dataset is a collection of data arranged in some order.
- It can be represented as a table with rows and columns, where each column corresponds to a variable, and each row represents data points.
- Common file formats for tabular datasets include CSV, while JSON is suitable for tree-like data.

**Types of Data in Datasets:**
- Numerical data includes values like house prices or temperatures.
- Categorical data includes categories like Yes/No, True/False, or colors.
- Ordinal data is similar to categorical data but can be measured based on comparisons.

**Types of Datasets:**
1. **Image Datasets:**
   - Contain various images, used in computer vision tasks such as image classification and object detection.
   - Examples: ImageNet, CIFAR-10, MNIST.

2. **Text Datasets:**
   - Comprise textual information like articles or movie reviews, used in natural language processing (NLP) tasks.
   - Examples: Gutenberg Task dataset, IMDb movie reviews dataset.

3. **Time Series Datasets:**
   - Involve data points collected over time, used in forecasting and trend analysis.
   - Examples: Stock market data, weather data, sensor readings.

4. **Tabular Datasets:**
   - Organized in tables, suitable for tasks like regression and classification.
   - Example: The provided sample dataset in the article.

**Need of Dataset:**
- Well-prepared datasets are crucial for machine learning projects.
- They serve as the foundation for training accurate and reliable models.
- However, working with large datasets can be challenging, requiring efficient data management techniques and algorithms.

<h3> Data Preprocessing </h3>


**Data Pre-processing:**
- Data pre-processing is a crucial stage in preparing datasets for machine learning.
- It involves transforming raw data into a suitable format for model training.
- Common pre-processing techniques include data cleaning, standardization, feature scaling, and handling missing values.

**Training Dataset and Test Dataset:**
- In machine learning, datasets are typically divided into two parts:
  1. **Training Dataset:** Used to train the machine learning model.
  2. **Test Dataset:** Used to evaluate the model's performance.
- The division ensures the model's ability to generalize to new, unseen data.
- Datasets should be representative of the problem and properly split to avoid bias or overfitting.



<h3> Popular sources for Machine Learning datasets : </h3>

These are various sources for obtaining datasets for machine learning:

1. **Kaggle Datasets:**
   - Kaggle is a popular platform for data scientists and machine learners.
   - It offers a wide range of high-quality datasets in different formats.
   - You can find, download, and collaborate with others on data science-related projects.
   - [Kaggle Datasets](https://www.kaggle.com/datasets)

2. **UCI Machine Learning Repository:**
   - An important resource used by researchers and specialists since 1987.
   - Contains a vast collection of datasets categorized by machine learning tasks like regression, classification, and clustering.
   - Notable datasets include Iris, Vehicle Assessment, and Poker Hand.
   - [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)

3. **Datasets via AWS:**
   - Access datasets available through AWS resources, provided by various organizations and individuals.
   - These datasets can be accessed for analysis, reducing the time spent on data acquisition.
   - [Registry of Open Data on AWS](https://registry.opendata.aws/)

4. **Google's Dataset Search Engine:**
   - Google's Dataset Search helps researchers find and access datasets from various sources on the web.
   - It covers areas like social sciences, science, and environmental science.
   - Users can search for datasets using keywords and access them directly from the source.
   - [Google's Dataset Search Engine](https://toolbox.google.com/datasetsearch)

5. **Microsoft Datasets:**
   - Microsoft Research Open Data offers free datasets in areas like natural language processing, computer vision, and domain-specific sciences.
   - Access diverse and organized datasets that can be valuable for machine learning projects.
   - [Microsoft Research Open Data](https://msropendata.com/)

6. **Awesome Public Dataset Collection:**
   - Provides high-quality datasets organized by topics such as Agriculture, Biology, and Climate.
   - Most datasets are free, but it's essential to check the license before downloading.
   - [Awesome Public Dataset Collection](https://github.com/awesomedata/awesome-public-datasets)

7. **Government Datasets:**
   - Governments from various countries publish data collected from different departments for public use.
   - The goal is to increase transparency and encourage innovative use of government data.
   - Examples include:
     - [Indian Government dataset](https://data.gov.in/)
     - [US Government Dataset](https://www.data.gov/)
     - [European Union Open Data Portal](https://data.europa.eu/euodp/en/home)
     
8. **Computer Vision Datasets:**
   - Specifically for computer vision tasks like image classification, video classification, and image segmentation.
   - Ideal for projects in deep learning or image processing.
   - [Visual Data](https://www.visualdata.io/)

9. **Scikit-learn Dataset:**
   - Scikit-learn, a popular machine learning library in Python, offers several built-in datasets for practice and experimentation.
   - These datasets can be accessed through the Scikit-learn API and are useful for learning various machine learning algorithms.
   - Examples include the Iris dataset, Boston Housing dataset, and Wine dataset.
   - [Scikit-learn Datasets](https://scikit-learn.org/stable/datasets/index.html)

These sources provide a wealth of datasets for various machine learning applications and research.

<h1> Data Pre-processing : </h1>

Data preprocessing is indeed a critical step in preparing data for machine learning models. It involves various tasks to clean and format the data so that it can be used effectively. The steps you've mentioned are essential for data preprocessing, and I'll provide a brief summary of each:

1. **Getting the Dataset:**
   - Collect or obtain the dataset that you'll use for your machine learning project. Datasets can be in various formats, such as CSV, HTML, or Excel.
   

2. **Importing Libraries:**
   - Import necessary libraries for data manipulation, analysis, and visualization.
   - Common libraries include NumPy, Matplotlib, and Pandas.


3. **Importing the Dataset:**
   - Load your dataset into your Python environment using functions like `pd.read_csv()` for CSV files.
   - Ensure that your Python script is in the same directory as your dataset or specify the file path correctly.


4. **Handling Missing Data:**
   - Identify and handle missing values in your dataset.
   - Common methods include removing rows or columns with missing data or imputing missing values using the mean, median, or mode of the respective feature.


5. **Encoding Categorical Data:**
   - Convert categorical variables (text data) into a numerical format since many machine learning algorithms require numeric input.
   - Techniques like Label Encoding (assigning unique numbers to categories) and One-Hot Encoding (creating binary columns for each category) can be used.


6. **Splitting the Dataset into Training and Test Sets:**
   - Divide your dataset into two subsets: a training set and a test set.
   - The training set is used to train your machine learning model, while the test set is used to evaluate the model's performance.
   - Typically, this is done using functions like `train_test_split` from libraries like scikit-learn.


7. **Feature Scaling:**
   - Scale your feature variables to ensure they are on a similar scale.
   - Common methods include Standardization (scaling features to have mean 0 and variance 1) and Normalization (scaling features to a specific range, often [0, 1]).


These are the fundamental steps of data preprocessing for machine learning. Depending on your dataset and the specific machine learning model you're working with, you may need to perform additional data preprocessing steps, such as feature engineering or dimensionality reduction. Data preprocessing is a critical part of the machine learning pipeline, as the quality of your data directly impacts the model's performance.

In [None]:
# importing libraries  
import numpy as nm  
import matplotlib.pyplot as mtp  
import pandas as pd  
  
#importing datasets  
data_set= pd.read_csv('Dataset.csv')  
  
#Extracting Independent Variable  
x= data_set.iloc[:, :-1].values  
  
#Extracting Dependent variable  
y= data_set.iloc[:, 3].values  
  
#handling missing data(Replacing missing data with the mean value)  
from sklearn.preprocessing import Imputer  
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)  
  
#Fitting imputer object to the independent varibles x.   
imputerimputer= imputer.fit(x[:, 1:3])  
  
#Replacing missing data with the calculated mean value  
x[:, 1:3]= imputer.transform(x[:, 1:3])  
  
#for Country Variable  
from sklearn.preprocessing import LabelEncoder, OneHotEncoder  
label_encoder_x= LabelEncoder()  
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  
  
#Encoding for dummy variables  
onehot_encoder= OneHotEncoder(categorical_features= [0])    
x= onehot_encoder.fit_transform(x).toarray()  
  
#encoding for purchased variable  
labelencoder_y= LabelEncoder()  
y= labelencoder_y.fit_transform(y)  
  
# Splitting the dataset into training and test set.  
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  
  
#Feature Scaling of datasets  
from sklearn.preprocessing import StandardScaler  
st_x= StandardScaler()  
x_train= st_x.fit_transform(x_train)  
x_test= st_x.transform(x_test) 