# Finding the Best Markets to Advertise In


![best markets](finding.jpeg)


We're working for an an e-learning company that offers courses on programming. Most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc. We want to promote our product and we'd like to invest some money in advertisement. 

**Our goal in this project is to find out the two best markets to advertise our product in**.

To reach our goal, we could organize surveys for a couple of different markets to find out which would the best choices for advertising. This is very costly, however, and it's a good call to explore cheaper options first.

We can try to search existing data that might be relevant for our purpose. One good candidate is the data from freeCodeCamp's 2017 New Coder Survey. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.

The survey data is publicly available in [this GitHub repository](https://github.com/freeCodeCamp/2018-new-coder-survey).



In [1]:
! pwd

/home/ion/Documentos/albertjimrod/data-projects/06_Probability_and_Statistics/intermediate statistics in python/Project_Finding_the_best_markets_to_advertise_in


In [2]:
# load libraries
import pandas as pd
import numpy as np
import chardet
import re

In [3]:
#%%time
# checking dataset with the Universal Character Encoding Detector
#with open("2018-new-coder-survey.csv", "rb") as file:
#    print(chardet.detect(file.read()))

In [4]:
coder_survey = pd.read_csv("csv/2017-fCC-New-Coders-Survey-Data.csv",low_memory=False, encoding = 'utf-8' )

## Demostracion que el recuento no es lo que parece

In [5]:
maskofficial = coder_survey['JobRoleInterest'].str.contains('[W-w]eb|[M-m]obile') # returns an array of booleans
freq_table2 = maskofficial.value_counts(normalize = True) * 100
print(freq_table2)

True     86.312929
False    13.687071
Name: JobRoleInterest, dtype: float64


In [6]:
#coder_survey.loc[maskofficial,"JobRoleInterest"].value_counts().sum()

In [7]:
coder_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18175 entries, 0 to 18174
Columns: 136 entries, Age to YouTubeTheNewBoston
dtypes: float64(105), object(31)
memory usage: 18.9+ MB


In [8]:
datatype = coder_survey.dtypes
datatype_counter = datatype.value_counts()
datatype_counter

float64    105
object      31
dtype: int64

 ### Optimizing memory
 
 Let's see if we can reduce the memory space of the dataset we are working with.

In [9]:
for column in coder_survey.columns:
    if coder_survey[column].dtype == 'float64': 
        coder_survey[column] = coder_survey[column].astype('float32')
    if coder_survey[column].dtype == 'int64':
        coder_survey[column] = coder_survey[column].astype('int32')

In [10]:
coder_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18175 entries, 0 to 18174
Columns: 136 entries, Age to YouTubeTheNewBoston
dtypes: float32(105), object(31)
memory usage: 11.6+ MB


 ❗ The reduction of memory has been almost 39%

In [11]:
pd.set_option("display.max_columns", None)

In [12]:
coder_survey.head(5)

Unnamed: 0,Age,AttendedBootcamp,BootcampFinish,BootcampLoanYesNo,BootcampName,BootcampRecommend,ChildrenNumber,CityPopulation,CodeEventConferences,CodeEventDjangoGirls,CodeEventFCC,CodeEventGameJam,CodeEventGirlDev,CodeEventHackathons,CodeEventMeetup,CodeEventNodeSchool,CodeEventNone,CodeEventOther,CodeEventRailsBridge,CodeEventRailsGirls,CodeEventStartUpWknd,CodeEventWkdBootcamps,CodeEventWomenCode,CodeEventWorkshops,CommuteTime,CountryCitizen,CountryLive,EmploymentField,EmploymentFieldOther,EmploymentStatus,EmploymentStatusOther,ExpectedEarning,FinanciallySupporting,FirstDevJob,Gender,GenderOther,HasChildren,HasDebt,HasFinancialDependents,HasHighSpdInternet,HasHomeMortgage,HasServedInMilitary,HasStudentDebt,HomeMortgageOwe,HoursLearning,ID.x,ID.y,Income,IsEthnicMinority,IsReceiveDisabilitiesBenefits,IsSoftwareDev,IsUnderEmployed,JobApplyWhen,JobInterestBackEnd,JobInterestDataEngr,JobInterestDataSci,JobInterestDevOps,JobInterestFrontEnd,JobInterestFullStack,JobInterestGameDev,JobInterestInfoSec,JobInterestMobile,JobInterestOther,JobInterestProjMngr,JobInterestQAEngr,JobInterestUX,JobPref,JobRelocateYesNo,JobRoleInterest,JobWherePref,LanguageAtHome,MaritalStatus,MoneyForLearning,MonthsProgramming,NetworkID,Part1EndTime,Part1StartTime,Part2EndTime,Part2StartTime,PodcastChangeLog,PodcastCodeNewbie,PodcastCodePen,PodcastDevTea,PodcastDotNET,PodcastGiantRobots,PodcastJSAir,PodcastJSJabber,PodcastNone,PodcastOther,PodcastProgThrowdown,PodcastRubyRogues,PodcastSEDaily,PodcastSERadio,PodcastShopTalk,PodcastTalkPython,PodcastTheWebAhead,ResourceCodecademy,ResourceCodeWars,ResourceCoursera,ResourceCSS,ResourceEdX,ResourceEgghead,ResourceFCC,ResourceHackerRank,ResourceKA,ResourceLynda,ResourceMDN,ResourceOdinProj,ResourceOther,ResourcePluralSight,ResourceSkillcrush,ResourceSO,ResourceTreehouse,ResourceUdacity,ResourceUdemy,ResourceW3S,SchoolDegree,SchoolMajor,StudentDebtOwe,YouTubeCodeCourse,YouTubeCodingTrain,YouTubeCodingTut360,YouTubeComputerphile,YouTubeDerekBanas,YouTubeDevTips,YouTubeEngineeredTruth,YouTubeFCC,YouTubeFunFunFunction,YouTubeGoogleDev,YouTubeLearnCode,YouTubeLevelUpTuts,YouTubeMIT,YouTubeMozillaHacks,YouTubeOther,YouTubeSimplilearn,YouTubeTheNewBoston
0,27.0,0.0,,,,,,more than 1 million,,,,,,,,,,,,,,,,,15 to 29 minutes,Canada,Canada,software development and IT,,Employed for wages,,,,,female,,,1.0,0.0,1.0,0.0,0.0,0.0,,15.0,02d9465b21e8bd09374b0066fb2d5614,eb78c1c3ac6cd9052aec557065070fbf,,,0.0,0.0,0.0,,,,,,,,,,,,,,,start your own business,,,,English,married or domestic partnership,150.0,6.0,6f1fbc6b2b,2017-03-09 00:36:22,2017-03-09 00:32:59,2017-03-09 00:59:46,2017-03-09 00:36:26,,,,1.0,,,,,,,,,,,,,,1.0,,,,,,1.0,,,,1.0,,,,,,,,1.0,1.0,"some college credit, no degree",,,,,,,,,,,,,,,,,,,
1,34.0,0.0,,,,,,"less than 100,000",,,,,,,,,,,,,,,,,,United States of America,United States of America,,,Not working but looking for work,,35000.0,,,male,,,1.0,0.0,1.0,0.0,0.0,1.0,,10.0,5bfef9ecb211ec4f518cfc1d2a6f3e0c,21db37adb60cdcafadfa7dca1b13b6b1,,0.0,0.0,0.0,,Within 7 to 12 months,,,,,,1.0,,,,,,,,work for a nonprofit,1.0,Full-Stack Web Developer,in an office with other developers,English,"single, never married",80.0,6.0,f8f8be6910,2017-03-09 00:37:07,2017-03-09 00:33:26,2017-03-09 00:38:59,2017-03-09 00:37:10,,1.0,,,,,,,,,,,,,,,,1.0,,,1.0,,,1.0,,,,,,,,,1.0,,,1.0,1.0,"some college credit, no degree",,,,,,,,,,1.0,,,,,,,,,
2,21.0,0.0,,,,,,more than 1 million,,,,,,1.0,,1.0,,,,,,,,,15 to 29 minutes,United States of America,United States of America,software development and IT,,Employed for wages,,70000.0,,,male,,,0.0,0.0,1.0,,0.0,,,25.0,14f1863afa9c7de488050b82eb3edd96,21ba173828fbe9e27ccebaf4d5166a55,13000.0,1.0,0.0,0.0,0.0,Within 7 to 12 months,1.0,,,1.0,1.0,1.0,,,1.0,,,,,work for a medium-sized company,1.0,"Front-End Web Developer, Back-End Web Develo...",no preference,Spanish,"single, never married",1000.0,5.0,2ed189768e,2017-03-09 00:37:58,2017-03-09 00:33:53,2017-03-09 00:40:14,2017-03-09 00:38:02,1.0,,1.0,,,,,,,Codenewbie,,,,,1.0,,,1.0,,,1.0,,,1.0,,,,1.0,,,,,,,1.0,1.0,,high school diploma or equivalent (GED),,,,,1.0,,1.0,1.0,,,,,1.0,1.0,,,,,
3,26.0,0.0,,,,,,"between 100,000 and 1 million",,,,,,,,,,,,,,,,,I work from home,Brazil,Brazil,software development and IT,,Employed for wages,,40000.0,0.0,,male,,0.0,1.0,1.0,1.0,1.0,0.0,0.0,40000.0,14.0,91756eb4dc280062a541c25a3d44cfb0,3be37b558f02daae93a6da10f83f0c77,24000.0,0.0,0.0,0.0,1.0,Within the next 6 months,1.0,,,,1.0,1.0,,,,,,,,work for a medium-sized company,,"Front-End Web Developer, Full-Stack Web Deve...",from home,Portuguese,married or domestic partnership,0.0,5.0,dbdc0664d1,2017-03-09 00:40:13,2017-03-09 00:37:45,2017-03-09 00:42:26,2017-03-09 00:40:18,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,,,,1.0,,,,,1.0,,,,,"some college credit, no degree",,,,,,,,1.0,,1.0,1.0,,,1.0,,,,,
4,20.0,0.0,,,,,,"between 100,000 and 1 million",,,,,,,,,,,,,,,,,,Portugal,Portugal,,,Not working but looking for work,,140000.0,,,female,,,0.0,0.0,1.0,,0.0,,,10.0,aa3f061a1949a90b27bef7411ecd193f,d7c56bbf2c7b62096be9db010e86d96d,,0.0,0.0,0.0,,Within 7 to 12 months,1.0,,,,1.0,1.0,,1.0,1.0,,,,,work for a multinational corporation,1.0,"Full-Stack Web Developer, Information Security...",in an office with other developers,Portuguese,"single, never married",0.0,24.0,11b0f2d8a9,2017-03-09 00:42:45,2017-03-09 00:39:44,2017-03-09 00:45:42,2017-03-09 00:42:50,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,,,,,bachelor's degree,Information Technology,,,,,,,,,,,,,,,,,,


## Data Exploration

### Understanding the Data:

- Why I'm using a ready-made data set instead of organizing a survey?.

    - I wanted to get a large sample size quickly and efficiently, so I decided to use pre-existing data instead of conducting my own survey." althought it may not be the best option for every situation.


- What's this data set about.

    - An anonymous survey of thousands of people who started coding less than 5 years ago.
    
- Where can this data set be downloaded.
    - [Download 2017-fCC-New-Coders-Survey-Data.csv](https://github.com/freeCodeCamp/2017-new-coder-survey/blob/master/clean-data/2017-fCC-New-Coders-Survey-Data.csv)

In [13]:
coder_survey['Age'].unique()

array([27., 34., 21., 26., 20., 28., 29., 23., 24., 22., 18., 44., 32.,
       46., 31., 30., 19., 54., 37., 36., 16., 17., 25., nan, 45., 48.,
       33., 43., 35., 42., 53., 15., 41., 60., 39., 38., 56., 52., 13.,
       14., 69., 40., 50., 47., 57., 59., 12., 58., 70., 68., 51., 11.,
       49., 73., 55., 71., 67., 72.,  1., 10., 61., 62., 63.,  0., 76.,
        8., 90., 66.,  2.,  5., 65.,  3., 64., 75.], dtype=float32)

In [14]:
coder_survey['CityPopulation'].value_counts()

more than 1 million              6534
between 100,000 and 1 million    5276
less than 100,000                3544
Name: CityPopulation, dtype: int64

In [15]:
coder_survey['CountryCitizen'].value_counts().head(10)

United States of America    5480
India                       1594
United Kingdom               640
Canada                       564
Brazil                       399
Poland                       297
Russia                       294
Germany                      251
Ukraine                      246
France                       245
Name: CountryCitizen, dtype: int64

In [16]:
coder_survey['MonthsProgramming'].unique()

array([  6.,   5.,  24.,  12.,   4.,  29.,  18.,   1.,   3.,  nan,   9.,
        40.,  14.,  28.,  20.,   2.,   0.,  25.,   8.,  32.,  15.,  16.,
        48.,  10.,  26.,  60., 200.,  36.,  30.,  52.,  58.,  19.,  45.,
        50.,  54., 100.,  80., 120.,  22.,  72.,   7.,  43.,  13.,  17.,
        84.,  21.,  66.,  11.,  96.,  49., 744.,  55.,  35., 250., 240.,
        90., 572.,  42.,  51., 192.,  39.,  70., 202.,  86.,  38.,  27.,
        33.,  34.,  59., 180.,  23., 108.,  46.,  68.,  65.,  44.,  41.,
       105., 168., 110., 190., 150.,  31., 204., 480., 300.,  56., 600.,
       140., 160., 228.,  57.,  75.,  87., 400.,  85., 156., 130., 135.,
       360., 132.,  82., 743.,  95., 113., 124.,  76.,  64., 370., 144.,
       264.,  62.,  73.,  83., 123.,  63., 336., 114.,  78., 111.,  47.,
       432., 216., 244.,  37.,  92., 720., 230.,  94., 103.,  69.,  53.,
       500., 115., 136., 312., 276., 171., 198.,  67.,  97.,  99.,  61.,
       450., 220.,  98., 205., 127., 420., 107., 34

In [17]:
coder_survey['MoneyForLearning'].value_counts().head(10)

0.0       7985
100.0     1166
200.0      789
500.0      631
50.0       577
1000.0     521
300.0      465
20.0       314
2000.0     255
150.0      241
Name: MoneyForLearning, dtype: int64

In [18]:
coder_survey['JobRoleInterest'].head(10)

0                                                  NaN
1                             Full-Stack Web Developer
2      Front-End Web Developer, Back-End Web Develo...
3      Front-End Web Developer, Full-Stack Web Deve...
4    Full-Stack Web Developer, Information Security...
5                                                  NaN
6                             Full-Stack Web Developer
7                                                  NaN
8                                                  NaN
9    Full-Stack Web Developer,   Quality Assurance ...
Name: JobRoleInterest, dtype: object

## Representability

We will focus only answering questions about the population of new programmers who are interested in our offer of courses on mobile and web development.

However, first of all it is interesting to create a frequency distribution table, taking percentages. From this table the following questions must be answered: 

 - Are people interested only in one topic or on the contrary are interested in more than one topic. 

 - If most people are interested in more than one topic the sample is still representative. 
 
 - How many people are interested in web or mobile applications.



Having the configuration of lists by rows that we have in the dataframe, we can see that with the following example that every element that is contained in the list will not be taken into account by the `str.contains` method.

We can see that the result returned by contains is 100% false

........

# Here are the steps shown in the official solution

It can be seen that we have null values in the series with which we are going to work

In [19]:
#coder_survey['JobRoleInterest'].isnull().value_counts()

In [20]:
#antes_dropna = coder_survey['JobRoleInterest'].isnull().sum()
#antes_dropna

In [21]:
###  Official solution

## All rows in our series that contain null values are deleted

#coder_survey['JobRoleInterest'].dropna(axis = 0, inplace=True)

In [22]:
#despues = coder_survey['JobRoleInterest'].isnull().sum()
#despues

In [23]:
#coder_survey['JobRoleInterest'].dtype

When `'dtype('O')'` is displayed in pandas, it refers to the object data type, which can contain a variety of data types, including text and other non-numeric object types.

<br>

With this in mind we will homogenize the content of the series to be able to work with it.

In [24]:
##interests_no_nulls = coder_survey['JobRoleInterest'].dropna()
#interests_no_nulls.isnull().sum()

In [25]:
#coder_survey['JobRoleInterest'].isnull().sum()

In [26]:
#coder_survey['JobRoleInterest'] = coder_survey['JobRoleInterest'].astype(str)
#coder_survey['JobRoleInterest'].replace(['nan'], np.nan, inplace=True)
#coder_survey.dropna(subset=['JobRoleInterest'], inplace=True)

In [27]:
#coder_survey['JobRoleInterest'].isnull().sum()


As we have elements separated by commas, what we will do is make use of `str.split` 
that Splits the string in the Series / Index from the beginning, at the specified delimiter string.

In [28]:
#interests_no_nulls = coder_survey['JobRoleInterest'].dropna()
#splitted_interests = interests_no_nulls.str.split(',') 
#splitted_interests.head(5)

In [29]:
###  Official solution

## In the official solution they show us which are the times that 1 or more elements have been chosen per user.
## and work continues from now on based on these data.

#n_of_options = splitted_interests.apply(lambda x: len(x))
#n_of_options.value_counts(normalize = True).sort_index() * 100

In [30]:
#splitted_interests

In [31]:
###  Official solution

##Transform each element of a list-like to a row, replicating index values.

#splitted_interests.explode('JobRoleInterest').value_counts()

The above Value Counts output format is striking because there is scrolled text and that means that the space character has not been removed.

### Alert 1

## This is the process I have followed.

In [32]:
### Same as official solution

coder_survey['JobRoleInterest'].isnull().value_counts()

True     11183
False     6992
Name: JobRoleInterest, dtype: int64

In [33]:
### Same as official solution

coder_survey['JobRoleInterest'].dropna(axis = 0, inplace=True)

In [38]:
coder_survey['JobRoleInterest'].isnull().sum() # hago dropna de la columna pero veo que no hacer nada.

0

In [35]:
coder_survey['JobRoleInterest'] = coder_survey['JobRoleInterest'].astype(str)
coder_survey['JobRoleInterest'].replace(['nan'], np.nan, inplace=True)
coder_survey.dropna(subset=['JobRoleInterest'], inplace=True)

In [37]:
### Same as official solution

splitted_interests = coder_survey['JobRoleInterest'].str.split(',') 
splitted_interests.head(5)

1                           [Full-Stack Web Developer]
2    [  Front-End Web Developer,  Back-End Web Deve...
3    [  Front-End Web Developer,  Full-Stack Web De...
4    [Full-Stack Web Developer,  Information Securi...
6                           [Full-Stack Web Developer]
Name: JobRoleInterest, dtype: object

In [39]:
## My solution:

### Transform each element of a list-like to a row.

serie_exploded = coder_survey['JobRoleInterest'].explode()
serie_exploded.head(5)

1                             Full-Stack Web Developer
2      Front-End Web Developer, Back-End Web Develo...
3      Front-End Web Developer, Full-Stack Web Deve...
4    Full-Stack Web Developer, Information Security...
6                             Full-Stack Web Developer
Name: JobRoleInterest, dtype: object

In [40]:
### compruebo que no se me aparezcan missing values

coder_survey['JobRoleInterest'].isnull().sum()

0

In [41]:
## Split strings around given separator/delimiter.


serie_splited = serie_exploded.str.split(',')
serie_splited

1                               [Full-Stack Web Developer]
2        [  Front-End Web Developer,  Back-End Web Deve...
3        [  Front-End Web Developer,  Full-Stack Web De...
4        [Full-Stack Web Developer,  Information Securi...
6                               [Full-Stack Web Developer]
                               ...                        
18161                           [Full-Stack Web Developer]
18162    [  Data Scientist,  Game Developer,    Quality...
18163    [Back-End Web Developer,  Data Engineer,    Da...
18171    [  DevOps / SysAdmin,    Mobile Developer,    ...
18174    [Back-End Web Developer,  Data Engineer,    Da...
Name: JobRoleInterest, Length: 6992, dtype: object

In [42]:
nueva_serie = serie_splited.explode()
nueva_serie.value_counts()

 Full-Stack Web Developer           2490
   Front-End Web Developer          2287
 Back-End Web Developer             1997
   Mobile Developer                 1734
Full-Stack Web Developer            1708
                                    ... 
Pharmacy tech                          1
data journalist / data visualist       1
Desings                                1
 Infrastructure Architect              1
 IT specialist                         1
Name: JobRoleInterest, Length: 236, dtype: int64

In [43]:
new_serie = nueva_serie.str.strip() # remove 'white spaces' porque veo que en la celda anterior el texto está desplazado
new_serie.value_counts()

Full-Stack Web Developer            4198
Front-End Web Developer             3533
Back-End Web Developer              2772
Mobile Developer                    2304
Data Scientist                      1643
                                    ... 
Pharmacy tech                          1
data journalist / data visualist       1
Desings                                1
Infrastructure Architect               1
IT specialist                          1
Name: JobRoleInterest, Length: 208, dtype: int64

Comparacion entre la version oficial y la mia

In [None]:
official_value_counts = serie_exploded.value_counts()
mine_value_counts = new_serie.value_counts()

In [None]:
official = official_value_counts.reset_index()
mine = mine_value_counts.reset_index() # my solution

In [None]:
official

In [None]:
mine

In [None]:
comparacion = official.join(mine, lsuffix='_official', rsuffix='_mine')

comparacion_sin_nan = comparacion # .dropna()

comparacion_sin_nan = comparacion_sin_nan.reset_index(drop=True)
comparacion_sin_nan.head(20)

Hasta aqui la demostración de que la version oficial y la mia son diferentes, ahora viene el como se seleccionan los cursos entre `'Web Developer|Mobile Developer'` sin tener en cuenta los posibles errores tipograficos, cosa que como se verá en adelante ocurre.

In [None]:
#serie_exploded = coder_survey['JobRoleInterest'].explode()
#serie_splited = serie_exploded.str.split(',')
#nueva_serie = serie_splited.explode()
#new_serie = nueva_serie.str.strip() # remove 'white spaces'
new_serie.value_counts(normalize=True) * 100

In [None]:
serie_exploded.value_counts(normalize=True) * 100

In [None]:
mine_value_counts[mine_value_counts.index.str.contains('[M-m]obile')]

In [None]:
mine_value_counts[mine_value_counts.index.str.contains('[W-w]eb')]

In [None]:
official_value_counts[official_value_counts.index.str.contains('[M-m]obile')]

<br>

It is not a surprise to find typographical errors derived from the nature of the survey. 

Now we know that we must correct.

In [None]:
coder_survey.shape

In [None]:
# Create a new dataframe with column 'col2' decomposed into 22600 individual elements
coder_survey['JobRoleInterest'].str.split(',').explode().reset_index(drop=True)

In [None]:
# Crear un nuevo dataframe con todas las filas repetidas por cada valor en la columna explotada
new_df = pd.DataFrame({
    col: np.repeat(coder_survey[col].values, coder_survey['JobRoleInterest'].str.split(',').apply(len))
    for col in coder_survey.columns if col != 'JobRoleInterest'
})

new_df['JobRoleInterest'] = coder_survey['JobRoleInterest'].str.split(',').explode().reset_index(drop=True)

In [None]:
new_df['JobRoleInterest'] = new_df['JobRoleInterest'].str.strip()

In [None]:
new_df['JobRoleInterest'].value_counts().head(10)

In [None]:
new_df.shape

In [None]:
# Movile Developer

replacement_mobile_dict = {"mobile":"Mobile",
                           "developer":"Developer",
                           "development":"Developer",
                           "Development":"Developer"}

for to_replace in replacement_mobile_dict:
    new_df['JobRoleInterest'] = new_df['JobRoleInterest'].str.replace(to_replace,replacement_mobile_dict[to_replace])

In [None]:
new_df['JobRoleInterest'].isna().sum()

In [None]:
# Full Stack Web Developer / Front End Web Developer / Back End Web Developer Web Designer
# Front End Web Designer / Front End Web Developer / Front End Web Designer

replacement_web_dict = {"Full-Stack Web Developer":"Full Stack Web Developer",
                        "Front-End Web Developer":"Front End Web Developer",
                        "Back-End Web Developer":"Back End Web Developer",
                        "Web Design":"Web Designer",
                        "Front-End Web Designer":"Front End Web Designer",
                        "Software Developer or Front-End Web Developer":"Front End Web Developer",
                        "Front-End Web Designer":"Front End Web Designer"}

for to_replace in replacement_web_dict:
    new_df['JobRoleInterest'] = new_df['JobRoleInterest'].str.replace(to_replace,replacement_web_dict[to_replace])

In [None]:
coder_survey['JobRoleInterest'].isna().sum()

In [None]:
new_df['JobRoleInterest'] = new_df['JobRoleInterest'].str.strip()
new_df['JobRoleInterest'].value_counts().head(10)

In [None]:
mask_ndf = new_df['JobRoleInterest'].str.contains('[W-w]eb|[M-m]obile') # returns an array of booleans
freq_table = mask_ndf.value_counts(normalize = True) * 100
print(freq_table)

In [None]:
web_or_mobile = new_df['JobRoleInterest'].str.contains('Web Developer|Mobile Developer') # returns an array of booleans
freq_table = web_or_mobile.value_counts(normalize = True) * 100
print(freq_table)

In [None]:
new_df.loc[mask_ndf,"JobRoleInterest"].value_counts().sum()

In [None]:
maskofficial = coder_survey['JobRoleInterest'].str.contains('[W-w]eb|[M-m]obile') # returns an array of booleans
freq_table = mask_ndf.value_counts(normalize = True) * 100
print(freq_table)

In [None]:
coder_survey.shape

In [None]:
new_df.shape

In [None]:
import matplotlib.pyplot as plt

# Gráfico de barras para la Tabla 1
plt.figure(figsize=(12, 6))  # Tamaño de la figura (opcional)

# Subplot 1: Tabla 2
plt.subplot(1, 2, 1)  # 1 fila, 2 columnas, segundo subplot
freq_table2.plot.bar(alpha=0.5, color="green")
plt.xlabel('[W-w]eb|[M-m]obile')
plt.ylabel('')
plt.title('Uncleaned Series Dataset 6992 rows')

# Subplot 2: Tabla 1
plt.subplot(1, 2, 2)  # 1 fila, 2 columnas, primer subplot
freq_table.plot.bar(alpha=0.5, color="red")
plt.xlabel('[W-w]eb|[M-m]obile')
plt.ylabel('Frecuencias')
plt.title('Clean Series and DataSet 22600 rows')

plt.tight_layout()  # Ajusta el espaciado entre subplots
plt.show()

In [None]:
# Isolate the participants that answered what role they'd be interested in
fcc_good = new_df[new_df['JobRoleInterest'].notnull()].copy()

# Frequency tables with absolute and relative frequencies
absolute_frequencies = fcc_good['CountryLive'].value_counts()
relative_frequencies = fcc_good['CountryLive'].value_counts(normalize = True) * 100

# Display the frequency tables in a more readable format
pd.DataFrame(data = {'Absolute frequency': absolute_frequencies, 
                     'Percentage': relative_frequencies}
            ).head(20)

In [None]:
pd.set_option("display.max_columns", None)
#new_df.head(20)

In [None]:
new_df['JobRoleInterest'].value_counts(normalize = True).head(10)[0:4]*100

In [None]:
seleccion_web = new_df['JobRoleInterest'].value_counts(normalize = True).head(10)[0:3].sum() * 100
seleccion_mobile = new_df['JobRoleInterest'].value_counts(normalize = True).head(10)[3:4].sum() * 100
seleccion_total = seleccion_web + seleccion_mobile

print(f"seleccion_web:    {seleccion_web} \nseleccion_mobile: {seleccion_mobile} \ntotal:            {seleccion_total}")

We can see that 56.68% percent of people are interested in web and mobile.


- Where are these new coders located.

In [None]:
# We select the indices of the output of value_counts that interest us

top_jobs = new_df['JobRoleInterest'].value_counts().head(10)[0:4]
top_jobs = top_jobs.index
top_jobs

To make we are working with a representative sample, drop all the rows where participants didn't answer what role they are interested in.

In [None]:
pd.set_option("display.max_rows", None)

In [None]:
new_df['JobRoleInterest'].isnull().sum()

The analysis focuses on the current location of `CountryLive` people and the importance of a greater number of potential customers in a market. To obtain a representative sample, it is advisable to eliminate the rows in which participants did not respond their interest in a specific role `JobRoleInterest`, since you can not know with certainty their interests.

Podemos continuar averiguando cuánto dinero están realmente dispuestos a gastar los nuevos codificadores en el aprendizaje. 

La publicidad dentro de los mercados donde la mayoría de las personas solo están dispuestas a aprender gratis es extremadamente improbable que sea rentable para nosotros. 

La columna `MoneyForLearning` describe en dólares estadounidenses la cantidad de dinero gastado por los participantes desde el momento en que comenzaron a codificar hasta el momento en que completaron la encuesta. 

Nuestra empresa vende suscripciones a un precio de $ 59 por mes, y por esta razón estamos interesados en saber:

- Cuánto dinero gasta cada estudiante por mes. 

- Parece una buena idea reducir nuestro análisis a solo cuatro países: Estados Unidos, India, Reino Unido y Canadá. 

Dos razones para esta decisión son: 

- Estos son los países que tienen las frecuencias absolutas más altas en nuestra muestra, lo que significa que tenemos una cantidad decente de datos para cada uno.

- Nuestros cursos están escritos en inglés, y el inglés es un idioma oficial en estos cuatro países. Cuanta más gente sepa inglés, mayores serán nuestras posibilidades de dirigirnos a las personas adecuadas con nuestros anuncios.



Cree una nueva columna que describa la cantidad de dinero que un estudiante ha gastado por mes (en el momento en que completó la encuesta). Deberá dividir la columna MoneyForLearning en la columna MonthsProgramming. Algunos estudiantes respondieron que habían estado aprendiendo a codificar durante 0 meses (podría ser que acababan de comenzar cuando completaron la encuesta). Para evitar dividir por 0, reemplace todos los valores de 0 con 1.

In [None]:
fcc_good['MonthsProgramming'].replace(0,1, inplace = True)

In [None]:
fcc_good['MonthsProgramming'].isnull().sum()

In [None]:
new_df['MoneyForLearning'].isnull().sum()

In [None]:
# New column for the amount of money each student spends each month
new_df['money_per_month'] = new_df['MoneyForLearning'] / new_df['MonthsProgramming']
new_df['money_per_month'].isnull().sum()

In [None]:
# Keep only the rows with non-nulls in the `money_per_month` column 
new_df = new_df[new_df['money_per_month'].notnull()]

We want to group the data by country, and then measure the average amount of money that students spend per month in each country. First, let's remove the rows having null values for the CountryLive column, and check out if we still have enough data for the four countries that interest us.

In [None]:
# Remove the rows with null values in 'CountryLive'
new_df = new_df[new_df['CountryLive'].notnull()]

# Frequency table to check if we still have enough data
new_df['CountryLive'].value_counts().head()

In [None]:
#new_df = df_not_nan.copy()
#new_df.loc[:,'money_spent_monthly'] = new_df['MoneyForLearning'].div(new_df['MonthsProgramming'], fill_value=1)

The results for the United Kingdom and Canada are a bit surprising relative to the values we see for India. If we considered a few socio-economical metrics (like GDP per capita), we'd intuitively expect people in the UK and Canada to spend more on learning than people in India.

It might be that we don't have have enough representative data for the United Kingdom and Canada, or we have some outliers (maybe coming from wrong survey answers) making the mean too large for India, or too low for the UK and Canada. Or it might be that the results are correct.

## Dealing with Extreme Outliers

Let's use box plots to visualize the distribution of the money_per_month variable for each country.

In [None]:
# Isolate only the countries of interest
only_4 = new_df[new_df['CountryLive'].str.contains(
    'United States of America|India|United Kingdom|Canada')]

# Box plots to visualize distributions
import seaborn as sns
sns.boxplot(y = 'money_per_month', x = 'CountryLive',
            data = only_4)
plt.title('Money Spent Per Month Per Country\n(Distributions)',
         fontsize = 16)
plt.ylabel('Money per month (US dollars)')
plt.xlabel('Country')
plt.xticks(range(4), ['US', 'UK', 'India', 'Canada']) # avoids tick labels overlap
plt.show()

t's hard to see on the plot above if there's anything wrong with the data for the United Kingdom, India, or Canada, but we can see immediately that there's something really off for the US: two persons spend each month 
20,000 per month.

In [None]:
# Isolate only those participants who spend less than 10000 per month
new_df = new_df[new_df['money_per_month'] < 20000]

Now let's recompute the mean values and plot the box plots again.



In [None]:
# Recompute mean sum of money spent by students each month
countries_mean = new_df.groupby('CountryLive').mean()
countries_mean['money_per_month'][['United States of America',
                            'India', 'United Kingdom',
                            'Canada']]

In [None]:
# Isolate again the countries of interest
only_4 = new_df[new_df['CountryLive'].str.contains(
    'United States of America|India|United Kingdom|Canada')]

# Box plots to visualize distributions
sns.boxplot(y = 'money_per_month', x = 'CountryLive',
            data = only_4)
plt.title('Money Spent Per Month Per Country\n(Distributions)',
         fontsize = 16)
plt.ylabel('Money per month (US dollars)')
plt.xlabel('Country')
plt.xticks(range(4), ['US', 'UK', 'India', 'Canada']) # avoids tick labels overlap
plt.show()

We can see a few extreme outliers for India (values over $2500 per month), but it's unclear whether this is good data or not. Maybe these persons attended several bootcamps, which tend to be very expensive. Let's examine these two data points to see if we can find anything relevant.

In [None]:
# Inspect the extreme outliers for India
india_outliers = only_4[
    (only_4['CountryLive'] == 'India') & 
    (only_4['money_per_month'] >= 2500)]
india_outliers.tail(10)

It seems that neither participant attended a bootcamp. Overall, it's really hard to figure out from the data whether these persons really spent that much money with learning. The actual question of the survey was "Aside from university tuition, about how much money have you spent on learning to code so far (in US dollars)?", so they might have misunderstood and thought university tuition is included. It seems safer to remove these two rows.

In [None]:
# Remove the outliers for India
only_4 = only_4.drop(india_outliers.index) # using the row labels

Looking back at the box plot above, we can also see more extreme outliers for the US (values over $6000 per month). Let's examine these participants in more detail.

In [None]:
# Examine the extreme outliers for the US
us_outliers = only_4[
    (only_4['CountryLive'] == 'United States of America') & 
    (only_4['money_per_month'] >= 6000)]

us_outliers.tail(10)

Out of these 11 extreme outliers, six people attended bootcamps, which justify the large sums of money spent on learning. For the other five, it's hard to figure out from the data where they could have spent that much money on learning. Consequently, we'll remove those rows where participants reported thay they spend $6000 each month, but they have never attended a bootcamp.

Also, the data shows that eight respondents had been programming for no more than three months when they completed the survey. They most likely paid a large sum of money for a bootcamp that was going to last for several months, so the amount of money spent per month is unrealistic and should be significantly lower (because they probably didn't spend anything for the next couple of months after the survey). As a consequence, we'll remove every these eight outliers.

In the next code block, we'll remove respondents that:

Didn't attend bootcamps.
Had been programming for three months or less when at the time they completed the survey.

In [None]:
# Remove the respondents who didn't attendent a bootcamp
no_bootcamp = only_4[
    (only_4['CountryLive'] == 'United States of America') & 
    (only_4['money_per_month'] >= 6000) &
    (only_4['AttendedBootcamp'] == 0)
]

only_4 = only_4.drop(no_bootcamp.index)


# Remove the respondents that had been programming for less than 3 months
less_than_3_months = only_4[
    (only_4['CountryLive'] == 'United States of America') & 
    (only_4['money_per_month'] >= 6000) &
    (only_4['MonthsProgramming'] <= 3)
]

only_4 = only_4.drop(less_than_3_months.index)

Looking again at the last box plot above, we can also see an extreme outlier for Canada — a person who spends roughly $5000 per month. Let's examine this person in more depth.

In [None]:
# Examine the extreme outliers for Canada
canada_outliers = only_4[
    (only_4['CountryLive'] == 'Canada') & 
    (only_4['money_per_month'] > 4500)]

canada_outliers.tail()

Here, the situation is similar to some of the US respondents — this participant had been programming for no more than two months when he completed the survey. He seems to have paid a large sum of money in the beginning to enroll in a bootcamp, and then he probably didn't spend anything for the next couple of months after the survey. We'll take the same approach here as for the US and remove this outlier.

In [None]:
# Remove the extreme outliers for Canada
only_4 = only_4.drop(canada_outliers.index)

Let's recompute the mean values and generate the final box plots.

In [None]:
# Recompute mean sum of money spent by students each month
only_4.groupby('CountryLive').mean()['money_per_month']

In [None]:
# Visualize the distributions again
sns.boxplot(y = 'money_per_month', x = 'CountryLive',
            data = only_4)
plt.title('Money Spent Per Month Per Country\n(Distributions)',
          fontsize = 16)
plt.ylabel('Money per month (US dollars)')
plt.xlabel('Country')
plt.xticks(range(4), ['US', 'UK', 'India', 'Canada']) # avoids tick labels overlap
plt.show()

## Choosing the Two Best Markets

Obviously, one country we should advertise in is the US. Lots of new coders live there and they are willing to pay a good amount of money each month (roughly $143).

We sell subscriptions at a price of 
93 per month, compared to India (
45).

The data suggests strongly that we shouldn't advertise in the UK, but let's take a second look at India before deciding to choose Canada as our second best choice:

66 each month.
We have almost twice as more potential customers in India than we have in Canada:

In [None]:
# Frequency table for the 'CountryLive' column
only_4['CountryLive'].value_counts(normalize = True) * 100

So it's not crystal clear what to choose between Canada and India. Although it seems more tempting to choose Canada, there are good chances that India might actually be a better choice because of the large number of potential customers.

At this point, it seems that we have several options:

Advertise in the US, India, and Canada by splitting the advertisement budget in various combinations:

60% for the US, 25% for India, 15% for Canada.
50% for the US, 30% for India, 20% for Canada; etc.
Advertise only in the US and India, or the US and Canada. Again, it makes sense to split the advertisement budget unequally. For instance:

70% for the US, and 30% for India.
65% for the US, and 35% for Canada; etc.
Advertise only in the US.

At this point, it's probably best to send our analysis to the marketing team and let them use their domain knowledge to decide. They might want to do some extra surveys in India and Canada and then get back to us for analyzing the new survey data.

## Conclusion
In this project, we analyzed survey data from new coders to find the best two markets to advertise in. The only solid conclusion we reached is that the US would be a good market to advertise in.

For the second best market, it wasn't clear-cut what to choose between India and Canada. We decided to send the results to the marketing team so they can use their domain knowledge to take the best decision.