In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial


Read in job descriptions from glassdoor and profile from linkedin. Removing missing values.


In [2]:
df = pd.read_csv('../data/dsjobs_training_culled.csv', index_col=0)
df = df.dropna()

In [3]:
profile_vector = pd.read_csv('../data/profile_vector.csv', index_col=0)


Creating a Full dataframe that includes the profile as the first observation.


In [4]:
full_df = profile_vector['profile'].append(df['jobs'])

Fitting TfidfVectorizer on the whole corpus and creating a dataframe with the results.

In [5]:
vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit(full_df)
transformed_model = vectorizer.transform(full_df)
tfidf_df = pd.DataFrame(transformed_model.toarray())

In [7]:
profile = tfidf_df.iloc[0, :]

Calculating the Cosine distance between the linkedin profile and each job posting.

In [19]:
distances = []
for i in range(len(tfidf_df.index)):
    distances.append(spatial.distance.cosine(profile, tfidf_df.iloc[i,:]))

Sort distances by closest to furthest 

In [20]:
sorted_distances = np.sort(distances)

In [24]:
sorted_distances[:6]

array([0.        , 0.80989248, 0.84734903, 0.8862868 , 0.89611812,
       0.89793871])

Argsort distances to get indices of a sorted list

In [21]:
indices = np.argsort(distances)

In [23]:
indices[:6]

array([  0,  14,  98, 113, 103,  89])

In [64]:
indices_df = pd.DataFrame({'indices': indices})

In [75]:
sorted_distances_df = pd.DataFrame({'distances': sorted_distances})

Sort the full dataframe by the sorted indices to get descriptions in order of most similar to least similar

In [25]:
sorted_df = pd.DataFrame({'jobs': full_df.iloc[indices]}).set_index(np.arange(0,120))

Create column 'Labels' of zeros to be filled in with either a True, or a False

In [27]:
sorted_df['labels'] = np.zeros(120)

Create test dataframe to experiment with user issued labels

In [46]:
sorted_tfidf = pd.DataFrame(tfidf_df.iloc[indices]).set_index(np.arange(0,120))

In [113]:
total_df = pd.concat([indices_df, sorted_df, sorted_distances_df, sorted_tfidf], axis=1)

In [114]:
total_df.iat[0,2] = 10.0

In [814]:
test_df = total_df.copy()

In [815]:
test_df.head()

Unnamed: 0,indices,jobs,labels,distances,0,1,2,3,4,5,...,4684,4685,4686,4687,4688,4689,4690,4691,4692,4693
0,0,Data Scientist Greater Seattle Area Data Scien...,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.117186,0.0,0.0,0.0
1,14,Are you interested in working for one of the m...,0.0,0.809892,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,98,The Microsoft Cloud+AI Design team is looking ...,0.0,0.847349,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,113,"Lead Data Scientist\n\nSeattle, WA\n\nJob Desc...",0.0,0.886287,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,103,Overall Job Purpose:\n\nThis role will be loca...,0.0,0.896118,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [816]:
class yes():
    def __init__(self, df):
        self.df = df
    
    def prompt_user(self, df, index):
        recommended_posting = pd.DataFrame(df.iloc[index, :]).T
        original_index = int(recommended_posting['indices'])
        label_index = self.df[self.df['indices'] == original_index].index[0]
        print(recommended_posting.iloc[0,1])
        label = input("\nyes/no (quit)")
        if label == 'quit':
            return "Done"
        if label == 'yes':
            self.df.iat[label_index,2] = 1.0
        if label == 'no':
            self.df.iat[label_index,2] = -1.0
        return self.df
    
    def improve_yes(self, df):
        yes_df = pd.DataFrame(df[df['labels'] == 1.0])
        yes_remains = pd.DataFrame(df[df['labels'] == 0.0])
        yes_remains = yes_remains.set_index(np.arange(0, yes_remains.shape[0]))
        distances_from_yes = np.zeros((yes_remains.shape[0], yes_df.shape[0]))
        for i in range(yes_df.shape[0]):
            for j in range(yes_remains.shape[0]):
                distances_from_yes[j,i] = (spatial.distance.cosine(yes_df.iloc[0,4:], yes_remains.iloc[j,4:]))
        yes_remains['distances_from_yes'] = np.sum(distances_from_yes, axis=1)
        return yes_remains
    
    def improve_no(self, df):
        no_df = pd.DataFrame(df[df['labels'] == -1.0])
        no_remains = pd.DataFrame(df[df['labels'] == 0.0])
        no_remains = no_remains.set_index(np.arange(0, no_remains.shape[0]))
        distances_from_no = np.zeros((no_remains.shape[0], no_df.shape[0]))
        for i in range(no_df.shape[0]):
            for j in range(no_remains.shape[0]):
                distances_from_no[j,i] = (spatial.distance.cosine(no_df.iloc[0,4:], no_remains.iloc[j,4:]))
        no_remains['distances_from_no'] = np.sum(distances_from_no, axis=1)
        return no_remains
    
    def find_next_best(self, df, index):
        adjusted_df = self.prompt_user(df, index)
        try:
            yes_remains = self.improve_yes(adjusted_df)
            no_remains = self.improve_no(adjusted_df)
        except TypeError:
            return self.df
        total_distance = yes_remains.distances + yes_remains.distances_from_yes - no_remains.distances_from_no
        remains = yes_remains.copy()
        remains['total_distance'] = total_distance
        next_best_index = remains.total_distance.idxmin()
        return self.find_next_best(remains, next_best_index)
    

In [817]:
test = yes(test_df)

In [818]:
final_df = test.find_next_best(test_df, 1)

Are you interested in working for one of the most exciting products in Microsoft, passionate about exceeding customer expectations and advancing Microsoft's cloud first strategy? Are you interested in a start-up like environment, excited about cloud computing technology and driving growth in one of Microsoft's core businesses? If so, then look no further than the Azure Customer Experience (CXP) Team!

Microsoft Azure provides customers with an on-demand and infinitely scalable infrastructure and platform for customers to build, host, and scale service applications on the Internet through Microsoft’s global data centers. As part of the Azure Engineering organization, Azure CXP is a rapidly growing team committed to driving Azure growth through our relentless pursuit of satisfied Azure customers, by leading world-class customer reliability engagements, engineering modern customer-first experiences for scale, and by driving deep customer insights and empathy into the broader Azure Enginee


yes/no (quit) yes


The Microsoft Cloud+AI Design team is looking for a Senior Data Scientist to join our Experience Analytics team ? we work hard, have fun, and value collaboration and individuality in each other. As a team, we?re passionate about maximizing the impact of Data Science work on Design and Product decisions, and we are leading discussions in this area within Microsoft and beyond.

Join us if you want to impact how millions of users use the cloud and all the services it enables. Our customers range from people with highly technical skills to information workers working with cloud enabled devices and services. Our product portfolio includes Dynamics, Azure, Power BI, PowerApps, Flow, and more.

The mission of the Experience Analytics team is to leverage product telemetry to unpack the nuance behind our customers? end-to-end journeys and drive design decisions. As a Data Scientist on the Experience Analytics team, you?ll answer key questions about users, their in-product workflow, and the qual


yes/no (quit) no


Minimum of 5 years of working experience.
Experience in a technical/international organization is an advantage.
You are fluent in both written and spoken English.
You have knowledge of SQL Server databases and SQL.
You master programming in R or Python.
You have broad experience in applied statistics.
Knowledge of C# is an advantage.
Knowledge of Microsoft Azure is an advantage (Azure Machine Learning, Stream Analytics, Azure functions)
You have experience with reporting tools like PowerBI.
IoT knowledge is an advantage.
Experis is an Equal Opportunity Employer (EOE/AA) - provided by Dice

machine learning,Azure,Python,R,Iot,C#,data science



yes/no (quit) yes


Principal Data Scientist

Customer Success and Support | Seattle, Washington

Position Summary

The Principal Data Scientist - Customer Success Analytics is responsible designing and building models & algorithms that power the next generation of actionable insights for the Customer Success organization at DocuSign.

The ideal candidate has a rich experience across the domains of Customer Success, Professional Services, Customer Support and Sales Lifecycle as a data scientist / decision scientist. You will leverage big data, statistical analysis, machine learning, AI and other advanced techniques to help DocuSign drive customer success, retention, upsell and cross-sell. You will partner closely with other data scientists and analysts in product engineering, marketing, sales and finance. This role will influence and shape the design, architecture and roadmap for advanced predictive and prescriptive analytics while collaborating & partnering with customer success teams on business & custo


yes/no (quit) no


Senior Data Scientist or Data Scientist #88345
JOB SUMMARY: Puget Sound Energys Strategic Customer Insights Group is looking for a Data Scientist to join our team. This role works in a team that is designed to attain the companys strategic goals via the extensive use of data and analysis to drive decisions and actions. The ideal candidate will be responsible for the continuous improvement of PSEs competitiveness as an energy partner in the northwest through an analytical focus on customer experience.

At the discretion of the hiring team, this position may be filled as a senior data scientist or data scientist depending on the qualifications of the selected candidate.

Families and businesses depend on PSE to provide the energy they need to pursue their dreams. Our steadfast commitment to serving Washington communities with safe, dependable and efficient energy started in 1886. Today were building the Northwests energy future through efforts like our award winning energy efficiency pro


yes/no (quit) yes


Job Description
The Amazon Demand Forecasting team seeks a Data Scientist with strong analytical and communication skills to join our team. We develop sophisticated algorithms that involve learning from large amounts of data, such as prices, promotions, similar products, and a product’s attributes, in order to forecast the demand of over 190 million products world-wide. These forecasts are used to automatically order more than $200 million worth of inventory weekly, establish labor plans for tens of thousands of employees, and predict the company’s financial performance. The work is complex and important to Amazon. With better forecasts we drive down supply chain costs, enabling the offer of lower prices and better in-stock selection for our customers.

In a typical day, you will work closely with talented machine learning scientists, statisticians, software engineers, and business groups. Your work will include cutting edge technologies that enable implementation of sophisticated mode


yes/no (quit) quit


In [820]:
final_df.shape

(120, 4698)