# Creating Features Through Investigation

On this page, we will focus on creating new features other than the given columns in the dataset. 
Let's start with some simple hypothesis and investigate whether we can use our hypothesis to create a new feature.  

```{note}
This page corresponds to the "Power Feature" part of the original project. I will revise and fix some errors in the original project on this page and add new investigations for more features. 
```

### First Investigation: Relationship Between Location and Fraudulent Posting

The first hypothesis we will investigate is that **location, especially state, is related to whether the posting is fraudulent.** For example, many fake job postings might have come from California or New York since those states have more job availability than others due to their high population. If the location is significantly related to the fraudulent variable, we can extract the state from the location and use it as a feature. Let's investigate the relationship and see if the hypothesis is correct.  

In [33]:
import pandas as pd 
import numpy as np
import re

In [85]:
train_data = pd.read_csv("./data/train_set.csv")
data = train_data[["location", "fraudulent"]]
data.head()

Unnamed: 0,location,fraudulent
0,"US, VA, Virginia Beach",0
1,"US, TX, Dallas",0
2,"NZ, , Auckland",0
3,"US, NE, Omaha",0
4,"US, CA, Los Angeles",0


In [86]:
target = data["location"]

In [87]:
target.fillna("No Location", inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  target.fillna("No Location", inplace = True)


In [89]:
len(target)

14304

To extract the state from the location, let's define the function to make this easier.

In [116]:
def extract_state(s):
    """ Extract state from the location"""
    """ The function can be used only when the state is formmated with two capital letter"""
    """ Input: Series, iterable object"""
    """ Output: List of States"""
    
    result = []    
    for i in np.arange(len(s)):
        if (s[i].__contains__("US")):
            extracted = re.findall(r'[A-Z]{2}', re.sub(r'[US]','',s[i])) 
            if extracted == []:
                extracted = ["Domestic"]
            result += extracted
        else:
            if s[i] == ["No Location"]:
                result += s[i]
            elif re.findall(r'[A-Z]{2}', s[i]) != []:
                result += ["Foreign"]
            else:
                result += ["No Location"]
    return result

In [117]:
re.findall(r'[A-Z]{2}', re.sub(r'[US]','',data["location"][22]))

[]

In [118]:
extract_state(data["location"][0:23])

['VA',
 'TX',
 'Foreign',
 'NE',
 'CA',
 'NY',
 'Foreign',
 'OH',
 'Foreign',
 'MA',
 'CA',
 'TN',
 'Foreign',
 'TX',
 'Foreign',
 'Foreign',
 'CO',
 'TX',
 'CA',
 'Foreign',
 'Foreign',
 'Foreign',
 'Domestic']

In [132]:
len(extract_state(data["location"][0:185]))

185

In [91]:
len(extract_state(data["location"]))

14133

In [125]:
len(extract_state(data["location"]))

14717

In [58]:
data["location"][0]

'US, VA, Virginia Beach'