# CrunchieMunchies

You work in marketing for a food company <b>myCorps</b>, which is developing a new kind of tasty, wholesome cereal called <b>CrunchieMunchies</b>. 

You want to demonstrate to consumers how healthy your cereal is in comparison to other leading brands, so you’ve dug up nutritional data on several different competitors.

Your task is to use <em>NumPy statistical calculations</em> to analyze this data and prove that your <b>CrunchieMunchies</b> is the healthiest choice for consumers.






# Task STEPS


1.First, import numpy.

In [1]:
# your code goes here
import numpy as np
# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"

2.Look over the <b><em>cereal.csv</em></b> file. This file contains the reported calorie amounts for different cereal brands. Load the data from the file and save it as <b><em>calorie_stats.</em></b>



In [2]:
# your code goes here
calorie_stats = np.genfromtxt('cereal.csv', delimiter=',')
# calorie_stats.shape                ### 77 entries ~ 77 competitors ~ calorie amounts for 77 brands

3.There are <em>60 calories per serving of CrunchieMunchies</em>. How much <b>higher</b> is the <b>average calorie count</b> of your competition?

Save the answer to the variable <b>average_calories</b> and print the variable to the terminal to see the answer.


In [3]:
# your code goes here
average_calories = np.average(calorie_stats)

# manual calculation
# average = calorie_stats.sum()
# average/77

print(f'Average calories : {average_calories}')

Average calories : 106.88311688311688


4.Does the <b>average calorie count</b> adequately reflect the distribution of the dataset? Let’s sort the data and see.

<b><em>Sort</em></b> the data and save the result to the variable <b>calorie_stats_sorted</b>. Print the sorted data to the terminal.


In [4]:
# your code goes here
# calorie_stats_sorted = calorie_stats.sort()   # array.sort() will sort in place -> pass by reference -> no copy made
calorie_stats_sorted = np.sort(calorie_stats)
calorie_stats_sorted


array([ 50.,  50.,  50.,  70.,  70.,  80.,  90.,  90.,  90.,  90.,  90.,
        90.,  90., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 110., 110., 110.,
       110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110.,
       110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110.,
       110., 110., 110., 110., 120., 120., 120., 120., 120., 120., 120.,
       120., 120., 120., 130., 130., 140., 140., 140., 150., 150., 160.])

5.Do you see what I’m seeing? Looks like <b><em>the majority of the cereals are higher than the mean</em></b>. Let’s see if the <b>median</b> is a better representative of the dataset.

Calculate the median of the dataset and save your answer to <b><em >median_calories</em></b>. Print the median so you can see how it compares to the mean.

In [5]:
# your code goes here
median_calories = np.median(calorie_stats)
print(f'Median calories: {median_calories}')

Median calories: 110.0


6.While the median demonstrates that <b><em><q>at least half of our values are over 100 calories</q></em></b>, it would be more impressive to show that a significant portion of the competition has a higher calorie count that CrunchieMunchies.

<b>Calculate different percentiles</b> and print them to the terminal until you find the lowest percentile that is greater than 60 calories. Save this value to the variable <b>nth_percentile</b>.


In [6]:
# your code goes here
nth_percentile = np.percentile(calorie_stats, 4) # 4% brands have calories less than CrunchieMunchies(60), 96% have greater
nth_percentile

70.0

7.While the percentile shows us that<b><em><q>the majority of the competition has a much higher calorie count</q></em></b>, it’s an awkward concept to use in marketing materials.

Instead, let’s calculate the percentage of cereals that <b><em><q>have more than 60 calories per serving</q></em></b>. Save your answer to the variable <b><em>more_calories</em></b> and print it to the terminal

In [7]:
# your code goes here

### personal notes 
## =>  more_calories = (calorie_stats[filt].sum())/(calorie_stats.sum())
## was wrong because (calorie_stats[filt].sum()) will return the sum
## of the calories greater than 60 (but we need COUNT of how much brands have calories greater than 60)

## (calorie_stats.sum()) means the sum of all calories (77) in the original array

## more_calories = sum of calories greater than 60/ sum of all calories gives 0.98 => wrong approach

                                        #    MEAN = SUM/ COUNT
filt = calorie_stats > 60
more_calories = len((calorie_stats[filt]))/ len(calorie_stats) # count of calories > 60/ total number of calories
print(f'Percentage : {more_calories * 100}')

Percentage : 96.1038961038961


8.Wow! That’s a really high percentage. That’s going to be very useful when we promote CrunchieMunchies. But one question is, <b><em>how much variation exists in the dataset? </b></em></q>Can we make the generalization that most cereals have around 100 calories or is the spread even greater?

Calculate the amount of variation by finding the <b><em>standard deviation</em</b> Save your answer to <b><em>calorie_std</em></b> and print to the terminal. How can we incorporate this value into our analysis?

In [8]:
# your code goes here
calorie_std = np.std(calorie_stats)
print(f'Standard Deviation : {calorie_std}')

Standard Deviation : 19.35718533390827


9.Write a short paragraph that sums up your findings and how you think this data could be used to 
<b>myCorp’s</b> advantage when marketing CrunchieMunchies.


#### The **calorie count for CrunchieMunchies is 60** while **average calorie count of competitors is around 106** which is a huge advantage for marketing. The median calorie count for competitors is 110 with only 3% brands having calories fewer than CrunchieMunchies. **CrunchieMunchies' calorie count is at 4th percentile in relation to other cereals.** This means that out of 77 brands, 74 have greater amount of calories than CrunchieMunchies.