<a href="https://colab.research.google.com/github/ajit211998/Ajit-Kumar/blob/main/sparks_foundation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GRIP: The Sparks Foundation
### Data Science and Business Analytics Intern
### Prediction using Supervised ML

In this task we have to predict the percentage score of a student based on the number of hours studied. The task has two variables where the feature is the no. of hours studied and target value is the percentage score. This can be solved using simple linear regression.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
url = "http://bit.ly/w-data"
data = pd.read_csv(url)
data.head()

In [None]:
print(data.shape)

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
data.plot(x='Hours', y='Scores', style='*')  
plt.title('Hours vs Percentage')  
plt.xlabel('Hours Studied')  
plt.ylabel('Percentage Score')  
plt.show()

From the graph above, we can clearly see that there is a positive linear relation between the number of hours studied and percentage of score.

In [None]:
# The next step is to divide the data into "attributes" (inputs) and "labels" (outputs).
X = data.iloc[:, :-1].values  
y = data.iloc[:, :1].values  


Now that we have our attributes and labels, the next step is to split this data into training and test sets. We'll do this by using Scikit-Learn's built-in train_test_split() method:



In [None]:
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                            test_size=0.2, random_state=0) 

In [None]:
#Training the Algorithm
#We have split our data into training and testing sets, and now is finally the time to train our algorithm.


from sklearn.linear_model import LinearRegression  
regressor = LinearRegression()  
regressor.fit(X_train, y_train) 

print("Training complete.")

In [None]:
# Plotting the regression line
line = regressor.coef_*X+regressor.intercept_

# Plotting for the test data
plt.scatter(X, y)
plt.plot(X, line);
plt.show()

# Making Predictions
### Now that we have trained our algorithm, it's time to make some predictions.

In [None]:
print(X_test) # Testing data - In Hours
y_pred = regressor.predict(X_test) # Predicting the scores

### Evaluating the model
The final step is to evaluate the performance of algorithm. This step is particularly important to compare how well different algorithms perform on a particular dataset. For simplicity here, we have chosen the mean square error. There are many such metrics.

In [None]:
from sklearn import metrics  
print('Mean Absolute Error:', 
      metrics.mean_absolute_error(y_test, y_pred)) 

# Prediction using Unsupervised ML

### From the given ‘Iris’ dataset, predict the optimum number of clusters and represent it visually.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

In [None]:
# Load the iris dataset
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df.head() # See the first 5 rows

How do you find the optimum number of clusters for K Means? How does one determine the value of K?

In [None]:
# Finding the optimum number of clusters for k-means classification

x = iris_df.iloc[:, [0, 1, 2, 3]].values

from sklearn.cluster import KMeans
wcss = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', 
                    max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
    
# Plotting the results into a line graph, 
# allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') # Within cluster sum of squares
plt.show()

You can clearly see why it is called 'The elbow method' from the above graph, the optimum clusters is where the elbow occurs. This is when the within cluster sum of squares (WCSS) doesn't decrease significantly with every iteration.

From this we choose the number of clusters as ** '3**'.

In [None]:
# Applying kmeans to the dataset / Creating the kmeans classifier
kmeans = KMeans(n_clusters = 3, init = 'k-means++',
                max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(x)

In [None]:
# Visualising the clusters - On the first two columns
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], 
            s = 100, c = 'red', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], 
            s = 100, c = 'blue', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1],
            s = 100, c = 'green', label = 'Iris-virginica')

# Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], 
            s = 100, c = 'yellow', label = 'Centroids')

plt.legend()

# Exploratory Data Analysis - Retail
(Level - Beginner)


# Task3: ‘Exploratory Data Analysis’ on dataset ‘SampleSuperstore’ 

Here, i will perform 'EDA' on a given dataset As a business manager, try to find out the weak areas where we can work to make profit and what all business problems we can derive by exploring the data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt


# Read the given dataset

In [None]:
sample_data = pd.read_csv("/content/SampleSuperstore.csv")

In [None]:
sample_data.head()

In [None]:
sample_data.shape

In [None]:
sample_data.describe()

In [None]:
sample_data.info()

In [None]:
sample_data['Category'].value_counts()

In [None]:
sample_data.duplicated().sum()

In [None]:
sample_data.drop_duplicates()

In [None]:
sample_data.nunique()#Displat the unique data

Drop thr irrelevent column

In [None]:
col = ['Postal Code']
sample_drop = sample_data.drop(columns=col, axis=1)

In [None]:
sample_drop.corr()
sample_drop.cov()
sample_drop.head()

# Data Visulation

In [None]:
plt.figure(figsize=(16,8))
plt.bar("Sub-Category","Category",data=sample_drop, color="green")
plt.title("Category vs Sub-Category",fontsize=20)
plt.xlabel('Sub-Category',fontsize=15)
plt.ylabel('Category',fontsize=15)
plt.xticks(rotation=45)
plt.show()

In [None]:
sample_drop.hist(bins=50,figsize=(20,15))
plt.show()

In [None]:
sample_drop['State'].value_counts() # Count the repeateable states

In [None]:
plt.figure(figsize=(16,16))
sns.countplot(x=sample_drop['State'])
plt.title("STATE")
plt.xticks(rotation=90)
plt.show()

In [None]:
plt.figure(figsize=(12,10))
sample_drop['Sub-Category'].value_counts().plot.pie(autopct="%1.1f%%")
plt.show()

In [None]:
sample_drop.groupby('Sub-Category')['Profit','Sales'].agg(['sum']).plot.bar()
plt.title('TOTAL PROFIT AND SALES PER SUB-CATEGORY')
plt.show()

sns.set(style="whitegrid")
plt.figure(2,figsize=(16,8))

sns.barplot(x='Sub-Category',y='Profit',data=sample_data,palette='Spectral')
plt.suptitle("PIE CONSUMPTION PATTEERN IN THE UNITED STATE",fontsize=20)
plt.show()

In [None]:
sns.countplot(x=sample_data['Ship Mode'])
plt.show()

In [None]:
# ploting pair plot for Sub-Category
figsize=(15,10)
sns.pairplot(sample_drop,hue="Sub-Category")
plt.show()

In [None]:
plt.figure(figsize=(10,4))
sns.lineplot('Discount','Profit',data=sample_drop,color='y', label="Discount")
plt.legend()
plt.show()

In [None]:
def state_data_viewer(state):
  product_data = sample_drop.groupby(['State'])
  for state in states:
    data=product_data.get_group(state).groupby(['Category'])
    fig,ax=plt.subplots(1,3,figsize=(28,5))
    fig.suptitle(state,fontsize=14)
    ax_index=0
  for cat in ['Furniture','Office','Supplies','Technology']:
    cat_data=data.get_group(cat).groupby(['Sub-Category']).sum()
    sns.barplot(x=cat_data.Profit,y=cat_data.index,ax=ax[ ax_index])
    ax[ ax_index].set_ylabel(cat)
    ax_index +=1
    fig.show()
states=['California','Washington','Mississippi','Arizona','Texas']
state_data_viewer(states)

# conclusion
1. From the Histogram graph for profit, discount, quality and sales, we can  say that data is not normal
2. from plotting pair plots for Sub=Category graphs, there are outliers.
3. From the above Data Visulations, we can see the states and thje category ehere sales and profit are high or less. we can improve in those states by providing discounts in the preffered range so that the company and consumer will both be in profit.