# Finding the two best markets to Advertise E-learning products

In this project, We'll aim to find the two best markets to advertise the e-learnig products for a E-learning company that offers courses on programming. Most of the courses are on web and mobile development, But the company also cover many other domains, like Data Science, Game Development and other importent courses.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore') # For ignoring the unwanted warnings shown on the notebook 
pd.options.display.max_columns=200 # To avoid column truncation
%matplotlib inline

# Understanding the Data

To reach our goal, we could organize surveys for a couple of different markets to find out which would the best choices for advertising. This is very costly, however, and it's a good call to explore cheaper options first.

We can try to search existing data that might be relevant for our purpose. One good candidate is the data from freeCodeCamp's 2017 New Coder Survey. freeCodeCamp is a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.

In [2]:
fcc=pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv')

FileNotFoundError: [Errno 2] File 2017-fCC-New-Coders-Survey-Data.csv does not exist: '2017-fCC-New-Coders-Survey-Data.csv'

In [None]:
fcc.shape

In [None]:
fcc.head()

# Checking for Sample Representativity

As we mentioned in the introduction, most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc. For the purpose of our analysis, we want to answer questions about a population of new coders that are interested in the subjects we teach. We'd like to know:

    -->Where are these new coders located.
    -->What locations have the greatest densities of new coders.
    -->How much money they're willing to spend on learning.
    
So we first need to clarify whether the data set has the right categories of people for our purpose. The JobRoleInterest column describes for every participant the role(s) they'd be interested in working in. If a participant is interested in working in a certain domain, it means that they're also interested in learning about that domain. So let's take a look at the frequency distribution table of this column and determine whether the data we have is relevant

In [None]:
fcc['JobRoleInterest'].value_counts(normalize=True)*100

We can see that most of the students are interested in web development courses(full-stack web development, front-end web development and back-end web development). Few are interested in mobile development and other courses.

Many students are interested in more than one courses. It would be useful if we know how many of the students are interested in more than one course.
Next, we'll drop the null values and find the percentage of students with number of courses they are interested.

In [None]:
interested=fcc['JobRoleInterest'].dropna() # dropping the null values
interested_splited=interested.str.split(',') # splitting the courses based on ','
def length(element):
    return len(element)
no_of_courses=interested_splited.apply(length)
print('Courses'+'  '+'Percentage(%)')
no_of_courses.value_counts(normalize=True).sort_index()*100

From the frequency distribution table, we got to know that nearly 32% of students have a clear idea about what they are going to pursue. But the rest 68%  of students have mixed interests. The e-learning company offers various courses. The fact that students have a mixed interests is good.

The focus of our courses is on web and mobile development, so let's find out how many respondents chose at least one of these two options.

In [None]:
web_or_mobile_count=interested.str.contains('Web Developer|Mobile Developer').value_counts(normalize=True)*100
# frequency table for students interested in Web Developer and Mobile Developer
print(web_or_mobile_count)

In [None]:
plt.style.use('fivethirtyeight')
web_or_mobile_count.plot.bar()
plt.title('Most Participants are Interested in \nWeb or Mobile Development')
plt.ylabel('Percentage')
plt.xticks([0,1],['Web or mobile\ndevelopment', 'Other subject'],rotation = 0) # the initial xtick labels were True and False
plt.ylim([0,100])
plt.show()

About 86% of students are interested in Web or Mobile development. From this plot, we can say that this Sample is representative of population. We want to advertise our courses to people interested in all sorts of courses, but mostly Web and mobile development.

Now we need to figure out what are the best markets to invest money in for advertising our courses. We'd like to know:

    -->Where are these new students located.
    -->What are the locations with the greatest number of new students.
    -->How much money new students are willing to spend on learning.

# New students- Locations and Densities

The data set provides information about the location of each participant at a country level. The CountryCitizen variable describes the country of origin for each participant, and the CountryLive variable describes what country each participants lives in (which may be different than the origin country).

For our analysis, we'll work with the CountryLive variable because we're interested where people actually live at the moment when we run the ads. In other words, we're interested where people are located, not where they were born.

Because the data set provides information at a country level, we can think of each country as an individual market. This means we can frame our goal as finding the two best countries to advertise in.

In [None]:
# dropping all rows where participants didn't answer what role they are interested in.
fcc_notnull=fcc[fcc['JobRoleInterest'].notnull()].copy()
absolute_freq=fcc_notnull['CountryLive'].value_counts()
relative_freq=fcc_notnull['CountryLive'].value_counts(normalize=True)*100
pd.DataFrame(data={'absolute_frequency':absolute_freq,'relative_frequency':relative_freq})

Nearly 46% of students are from United states of America. India comes in second place (i.e.,) nearly 8%. But it doesn't vary much when comparesd to United Kingdom, Canada. All these are in terms of number of students interested in learning. 

This is useful information, but we need to go more in depth than this and figure out how much money people are actually willing to spend on learning. Advertising within markets where most people are only willing to learn for free is extremely unlikely to be profitable for us.

# Spending Money for Learning

The MoneyForLearning column describes in American dollars the amount of money spent by participants from the moment they started coding until the moment they completed the survey.

The E-learning company sells subscriptions at a price of $59 per month, and for this reason we're interested in finding out how much money each student spends per month.

It also seems like a good idea to narrow down our analysis to only four countries: the US, India, the United Kingdom, and Canada. Two reasons for this decision are:

    -->These are the countries having the highest absolute frequencies in our sample, which means we have a decent amount of data for each.
    -->Our courses are written in English, and English is an official language in all these four countries. The more people that know English, the better our chances to target the right people with our ads.

Some students answered that they had been learning to code for 0 months (it might be that they had just started when they completed the survey). To avoid dividing by 0, replace all the values of 0 with 1

In [None]:
fcc_notnull['MonthsProgramming'].replace(0,1,inplace=True)
fcc_notnull['MoneyPerMonth']=fcc_notnull['MoneyForLearning']/fcc_notnull['MonthsProgramming']
fcc_notnull=fcc_notnull[fcc_notnull['MoneyPerMonth'].notnull()]

Remove the null values in both 'MoneyPerMonth' and 'CountryLive' column, Since we are going to work on it.

In [None]:
fcc_good=fcc_notnull[fcc_notnull['CountryLive'].notnull()].copy()
# Frequency table to check if we still have enough data
fcc_good['CountryLive'].value_counts().head()

In [None]:
country_grouped=fcc_good.groupby('CountryLive')
mean=country_grouped.mean()
print(mean.loc[['United States of America','India', 'United Kingdom','Canada'],'MoneyPerMonth'])

The results for the United Kingdom and Canada are surprisingly low relative to the values we see for India. If we considered a few socio-economical metrics (like GDP per capita), we'd intuitively expect people in the UK and Canada to spend more on learning than people in India.

It might be that we don't have have enough representative data for the United Kingdom, Canada, and India, or we have some outliers (maybe coming from wrong survey answers) making the mean too big for India, or too low for the UK and Canada. Or it might be that the results are correct.

# Dealing with Extreme Outliers

In [None]:
four_countries=fcc_good[fcc_good['CountryLive'].str.contains('United States of America|India|United Kingdom|Canada')]
sns.boxplot(x='CountryLive',y='MoneyPerMonth',data=four_countries)
plt.xticks(range(4), ['US', 'UK', 'India', 'Canada'])
plt.title('Money Spent Per Month Per Country\n(Distributions)')

It's hard to see on the plot above if there's anything wrong with the data for the United Kingdom, India, or Canada, but we can see immediately that there's something really off for the US: two persons spend each month \$50000 or more for learning. This is not impossible, but it seems extremely unlikely, so we'll remove every value that goes over \$20,000 per month

In [None]:
four_countries=four_countries[four_countries['MoneyPerMonth']<20000]
mean=four_countries.groupby('CountryLive').mean()
print(mean['MoneyPerMonth'])

In [None]:
sns.boxplot(x='CountryLive',y='MoneyPerMonth',data=four_countries)
plt.xticks(range(4), ['US', 'UK', 'India', 'Canada'])
plt.title('Money Spent Per Month Per Country\n(Distributions)')

There are few outliers in India. But still it's unclear that whether these outliers are students who attended several bootcamps or not.

In [None]:
india_outliers=four_countries[(four_countries['MoneyPerMonth']>2000)&(four_countries['CountryLive']=="India")]
india_outliers

From the above table, we can come to a conclusion that Students from india who spend above 2000 had not participated in any bootcamps. So, these rows are outliers.

In [None]:
four_countries=four_countries.drop(india_outliers.index,axis=0)

In [None]:
sns.boxplot(x='CountryLive',y='MoneyPerMonth',data=four_countries)
plt.xticks(range(4), ['US', 'UK', 'India', 'Canada'])
plt.title('Money Spent Per Month Per Country\n(Distributions)')

Similarly like India, Now we're going to examine canada, zSince canada also contains few outliers.

In [None]:
canada_outliers=four_countries[(four_countries['CountryLive']=='Canada')&(four_countries['MoneyPerMonth']>4000)]
canada_outliers

THe student seems to have paid a large sum of money in the beginning to enroll in a bootcamp, and then he probably didn't spend anything for the next couple of months after the survey. We'll take the same approach here as for the US and remove this outlier.

In [None]:
four_countries=four_countries.drop(canada_outliers.index,axis=0)

In [None]:
sns.boxplot(x='CountryLive',y='MoneyPerMonth',data=four_countries)
plt.xticks(range(4), ['US', 'UK', 'India', 'Canada'])
plt.title('Money Spent Per Month Per Country\n(Distributions)')

We can see few outliers in US. We'll analyse  and find out whethere these data are needed or not.

In [None]:
usa_outliers=four_countries[(four_countries['MoneyPerMonth']>=6000)&(four_countries['CountryLive']=='United States of America')]
usa_outliers

Out of these 11 extreme outliers, six people attended bootcamps, which justify the large sums of money spent on learning. For the other five, it's hard to figure out from the data where they could have spent that much money on learning. Consequently, we'll remove those rows where participants reported thay they spend \$6000 each month, but they have never attended a bootcamp.

In the next code block, we'll remove respondents that:

    -->Didn't attend bootcamps.
    -->Had been programming for three months or less when at the time they completed the survey.

In [None]:
usa_outliers_updated=four_countries[(four_countries['MoneyPerMonth']>=6000)&(four_countries['CountryLive']=='United States of America')&(four_countries['AttendedBootcamp']==0)]
# Remove the respondents who didn't attendent a bootcamp
four_countries=four_countries.drop(usa_outliers_updated.index,axis=0)
less_than_3months=four_countries[(four_countries['MoneyPerMonth']>=6000)&(four_countries['CountryLive']=='United States of America')&(four_countries['MonthsProgramming']<=3)]
# Remove the respondents that had been programming for less than 3 months
four_countries=four_countries.drop(less_than_3months.index,axis=0)

In [None]:
sns.boxplot(x='CountryLive',y='MoneyPerMonth',data=four_countries)
plt.xticks(range(4), ['US', 'UK', 'India', 'Canada'])
plt.title('Money Spent Per Month Per Country\n(Distributions)')

In [None]:
four_countries.groupby('CountryLive').mean()['MoneyPerMonth'].sort_values(ascending=False)

In [None]:
# Frequency table to check if we still have enough data
fcc_good['CountryLive'].value_counts().head()

Now the result seems to be logically correct based on GDP of each country.

# Choosing the two Best markets

In [None]:
four_countries.groupby('CountryLive').mean()['MoneyPerMonth'].sort_values(ascending=False)

In [None]:
four_countries['CountryLive'].value_counts(normalize=True)*100

Considering the results we've found so far, one country we should definitely advertise in is the US. There are a lot of new students living there and they are willing to pay a good amount of money each month.

We need to choose one more market though.
Based on the number of students, India has more potential customers. But based on the amount, students from Canada are ready to pay more than students from India.

# Conclusion

We'll list out few decisions and forward it to the marketing team to use their domain knowledge to choose the best decision or make some corrections from our decision or make their own decision based on the analysis we made so far.

Investment decisions:

    -->Decision 1: 70% in USA and 30% in Canada.

    -->Decision 2: 70% in USA and 30% in India.

    -->Decision 3: 70% in USA, 15% in Canada and 15% in India.