# Strategy to group Engineering Colleges

You are an independent trainer who would like to pitch your Data Science training program to a set of Engineering colleges. You have data of 26 colleges after survey using questionnaires. Each college has been given a score for 5 performance criteria-Teaching, Fees, Placement, Internship & Infrastructure. Ratings are in the standardized scale of 1 to 5 where 5 has a higher weightage than 1. Segment the colleges into groups and come up with your pitch recommendations for each segment.

# Import Libraries and load data

In [5]:
import numpy as np
import pandas as pd

In [6]:
df=pd.read_csv('Engg_College_Data.csv') 

# Checking the data

In [7]:
df.head()

Unnamed: 0,SR_NO,Engg_College,Teaching,Fees,Placements,Internship,Infrastructure
0,1,A,5,2,5,5,3
1,2,B,4,2,5,5,3
2,3,C,4,5,5,4,5
3,4,D,5,4,5,4,4
4,5,E,2,5,2,2,5


In [8]:
df.shape

(26, 7)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   SR_NO           26 non-null     int64 
 1   Engg_College    26 non-null     object
 2   Teaching        26 non-null     int64 
 3   Fees            26 non-null     int64 
 4   Placements      26 non-null     int64 
 5   Internship      26 non-null     int64 
 6   Infrastructure  26 non-null     int64 
dtypes: int64(6), object(1)
memory usage: 1.5+ KB


# Get the count of unique values for Engg_College

In [10]:
df.Engg_College.nunique()

26

# Get the unique values for Engg_College

In [11]:
df.Engg_College.unique()

array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
      dtype=object)

# Checking the Summary Statistic

In [12]:
df.describe() 

Unnamed: 0,SR_NO,Teaching,Fees,Placements,Internship,Infrastructure
count,26.0,26.0,26.0,26.0,26.0,26.0
mean,13.5,2.807692,3.576923,2.884615,2.769231,3.384615
std,7.648529,1.132051,1.331858,1.531716,1.242826,1.29852
min,1.0,1.0,1.0,1.0,1.0,1.0
25%,7.25,2.0,2.25,1.25,2.0,3.0
50%,13.5,3.0,4.0,3.0,2.0,3.0
75%,19.75,3.0,5.0,4.0,4.0,4.75
max,26.0,5.0,5.0,5.0,5.0,5.0


# Removing the SR_NO column and storing the result in new object enggdata

In [13]:
enggdata = df.iloc[:,1:7] 

In [14]:
enggdata.head()

Unnamed: 0,Engg_College,Teaching,Fees,Placements,Internship,Infrastructure
0,A,5,2,5,5,3
1,B,4,2,5,5,3
2,C,4,5,5,4,5
3,D,5,4,5,4,4
4,E,2,5,2,2,5


# Creating the Dendrogram
### Importing dendrogram and linkage module

In [15]:
from scipy.cluster.hierarchy import dendrogram, linkage

### Choosing average linkage method

In [16]:
link_method = linkage(enggdata.iloc[:,1:6], method = 'average')

In [17]:
labelList = np.array(enggdata.Engg_College)
labelList

array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
      dtype=object)

In [21]:
dend = dendrogram(link_method,labels = labelList)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

### Cutting the Dendrogram with suitable clusters

In [22]:
dend = dendrogram(link_method,
                 truncate_mode='lastp',
                 p = 10,
                 labels = labelList 
                 )

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

# Importing fcluster module to create clusters

In [23]:
from scipy.cluster.hierarchy import fcluster

In [24]:
# Set criterion as maxclust,then create 3 clusters, and store the result in another object 'clusters'

clusters = fcluster(link_method, 3, criterion='maxclust')
clusters

array([1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3, 1, 3, 3, 3, 1, 1, 1, 3, 3,
       3, 2, 2, 1], dtype=int32)

Appending clusters to original dataset

In [25]:
df['clusters'] = clusters

In [26]:
df.head()

Unnamed: 0,SR_NO,Engg_College,Teaching,Fees,Placements,Internship,Infrastructure,clusters
0,1,A,5,2,5,5,3,1
1,2,B,4,2,5,5,3,1
2,3,C,4,5,5,4,5,1
3,4,D,5,4,5,4,4,1
4,5,E,2,5,2,2,5,1


### Cluster Frequency

In [27]:
df.clusters.value_counts().sort_index()

1    16
2     2
3     8
Name: clusters, dtype: int64

# Cluster Profiles

In [28]:
aggdata=df.iloc[:,1:8].groupby('clusters').mean()
aggdata['Freq']=df.clusters.value_counts().sort_index()
aggdata

Unnamed: 0_level_0,Teaching,Fees,Placements,Internship,Infrastructure,Freq
clusters,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,3.375,4.125,3.875,3.4375,3.875,16
2,1.0,1.0,1.0,1.5,5.0,2
3,2.125,3.125,1.375,1.75,2.0,8


- Cluster 1: Tier 1 colleges (Top Colleges)
- Cluster 2: Tier 3 colleges (Poor performing colleges/new college)
- Cluster 3: Tier 2 colleges (Medium performing colleges)

# Recommendations
1. For companies hiring, go to colleges for Placements are Tier 1 colleges, followed by Tier 2 colleges
2. For companies providing Training program to staffs and students, go to colleges are Tier 2 and Tier 3 Colleges, since Tier 1 is comparitively performing better.
3. Tier 3 colleges will need to concentrate more on Marketing and Advertisements about their campus to create awareness and attract students
4. Students looking to enroll in a college, can give priority to Tier 1 over Tier 2 and 3 colleges 

# Saving the Cluster Profiles in a csv file

In [22]:
#aggdata.to_csv('enggdata_hc.csv')