# Case Study based on Statistics

Meet Sally, a public school administrator. Some schools in her state of Maharashtra are performing below average academically. Her superintendent, under pressure from frustrated parents and voters, approached Sally with the task of understanding why these schools are under-performing. Not an easy problem, to be sure.

To improve school performance, Somali needs to learn more about these schools and their students, just as a business needs to understand its ow0n strengths and weaknesses and its customers.

Though Somali is eager to build an impressive explanatory model, she knows the importance of conducting preliminary research to prevent possible pitfalls or blind spots (e.g. cognitive biases). Thus, she engages in a thorough exploratory analysis, which includes: a lit review, data collection, descriptive and inferential statistics, and data visualization.

    • Create business by analysing the data on different KPI and create a report based on that.
    • What is  inside story of this business to showcase value to Somali.
    • What are the different alternative solutions you want to provide us for helping Somali on that?

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('student_data.txt', sep=",")
df

Unnamed: 0,name,school_rating,size,reduced_lunch,state_percentile_16,state_percentile_15,stu_teach_ratio,school_type,avg_score_15,avg_score_16,full_time_teachers,percent_black,percent_white,percent_asian,percent_hispanic
0,Allendale Elementary School,5.0,851.0,10.0,90.2,95.8,15.7,Public,89.4,85.2,54.0,2.9,85.5,1.6,5.6
1,Anderson Elementary,2.0,412.0,71.0,32.8,37.3,12.8,Public,43.0,38.3,32.0,3.9,86.7,1.0,4.9
2,Avoca Elementary,4.0,482.0,43.0,78.4,83.6,16.6,Public,75.7,73.0,29.0,1.0,91.5,1.2,4.4
3,Bailey Middle,0.0,394.0,91.0,1.6,1.0,13.1,Public Magnet,2.1,4.4,30.0,80.7,11.7,2.3,4.3
4,Barfield Elementary,4.0,948.0,26.0,85.3,89.2,14.8,Public,81.3,79.6,64.0,11.8,71.2,7.1,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,Winfree Bryant Middle School,3.0,611.0,57.0,59.1,65.2,16.9,Public,61.4,57.7,36.0,15.2,66.3,1.5,15.7
343,Winstead Elementary School,5.0,515.0,8.0,93.9,97.0,14.3,Public,92.0,89.3,36.0,3.3,87.4,3.1,4.1
344,Woodland Elementary,4.0,424.0,55.0,84.8,76.7,14.1,Public,69.4,79.4,30.0,11.6,70.5,2.1,9.7
345,Woodland Middle School,5.0,866.0,2.0,93.3,97.1,19.2,Public,89.8,84.9,45.0,4.5,77.6,10.0,4.4


## The above dataset is the data of the different schools in Maharastra. There are total 15 columns.
1. name - Representing the school name.
2. school_rating - Rating of the school.
3. size - Total vacancies in the school.
4. reduced_lunch - Quantity of lunch reduction.
5. state_percentile_16 - Average percentile in the 2016
6. state_percentile_15 - Average percentile in the 2015
7. stu_teach_ratio - Student techher ratio.
8. school_type - Type of school 
9. avg_score_15
10. avg_score_16
11. full_time_teachers
12. percent_black
13. percent_white
14. percent_asian
15. percent_hispanic

## First , we will check if there is any duplicate schools present in the dataFrame.


In [11]:
df['name'].value_counts().head(10) 

Liberty Elementary                  3
Rock Springs Elementary             2
Johnson Elementary                  2
South Side Elementary               2
Eakin Elementary                    2
Crockett Elementary                 1
Glenellen Elementary                1
Station Camp Elementary             1
Thomas Magnet                       1
White House Heritage High School    1
Name: name, dtype: int64

### From above output we can see that there are 5 schools which are repeated 
    Liberty Elementary                  3
    Rock Springs Elementary             2
    Johnson Elementary                  2
    South Side Elementary               2
    Eakin Elementary                    2
    
### We will create a new dataframe where all the duplicates i.e school names having count greater than 1 will be removed.

In [18]:
def removing_duplicates(column_name,df):
    colu=[]
    for column in df.columns:
        colu.append(column)
    if colu[0] == column_name:
        v_count = (df[colu[0]].value_counts() > 1).to_frame()
        v_count = v_count[v_count[colu[0]] == True]
        v_count = v_count.reset_index()
        tes = v_count['index'].to_list()
        df_du = df.drop(df.loc[df[colu[0]].isin(tes)].index)
        return (df_du,tes)
    else:
        return (print('Column name entered did not match any of the columns in dataframe'))

In [21]:
df_d = removing_duplicates('name',df)

### *df_du* is the new dataframe where the duplicates school name has been removed, Now to replace those school values we will take mean of each duplicate schools and add it to the *df_du* dataframe

In [17]:
df_du

Unnamed: 0,name,school_rating,size,reduced_lunch,state_percentile_16,state_percentile_15,stu_teach_ratio,school_type,avg_score_15,avg_score_16,full_time_teachers,percent_black,percent_white,percent_asian,percent_hispanic
0,Allendale Elementary School,5.0,851.0,10.0,90.2,95.8,15.7,Public,89.4,85.2,54.0,2.9,85.5,1.6,5.6
1,Anderson Elementary,2.0,412.0,71.0,32.8,37.3,12.8,Public,43.0,38.3,32.0,3.9,86.7,1.0,4.9
2,Avoca Elementary,4.0,482.0,43.0,78.4,83.6,16.6,Public,75.7,73.0,29.0,1.0,91.5,1.2,4.4
3,Bailey Middle,0.0,394.0,91.0,1.6,1.0,13.1,Public Magnet,2.1,4.4,30.0,80.7,11.7,2.3,4.3
4,Barfield Elementary,4.0,948.0,26.0,85.3,89.2,14.8,Public,81.3,79.6,64.0,11.8,71.2,7.1,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,Winfree Bryant Middle School,3.0,611.0,57.0,59.1,65.2,16.9,Public,61.4,57.7,36.0,15.2,66.3,1.5,15.7
343,Winstead Elementary School,5.0,515.0,8.0,93.9,97.0,14.3,Public,92.0,89.3,36.0,3.3,87.4,3.1,4.1
344,Woodland Elementary,4.0,424.0,55.0,84.8,76.7,14.1,Public,69.4,79.4,30.0,11.6,70.5,2.1,9.7
345,Woodland Middle School,5.0,866.0,2.0,93.3,97.1,19.2,Public,89.8,84.9,45.0,4.5,77.6,10.0,4.4
