# Classifying doctor violations by hand, then by machine

You're looking out for certain types of doctor violations! Whether keeping poor records, being addicted to drugs, or anything else. **You decide.**

**You're going to see how often doctors lose their license for that violation.** There are about 7000 records, though, and you ain't going to read all of them!

Steps:

1. **Classify some violations by hand**
1. Vectorize the **hand-classified violations**
1. Train a classifer on the **hand-classified violations**.
1. **Test the classifier**. If it's good, next step! If not, go back to training.
1. Vectorize the **unclassified violations**
1. Use the classifier to **predict the labels of the unclassified violations**
1. What actions were taken against those doctors?

It'll be magic!

In [29]:
import pandas as pd
import numpy as np

In [30]:
df = pd.read_csv("physicians-ny-violations.csv")
df.head(2)

Unnamed: 0,action,date_updated,eff_date,first,last,lic_num,lic_type,middle,misconduct,order_pdf,restrictions,url,year_of_birth
0,Revocation of certificate of incorporation.,09/29/2010,09/29/2010,P.C.,563 Grand Medical,196275,,,The corporation admitted guilt to the charge o...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,
1,Revocation of certificate of incorporation. P...,12/01/2010,12/08/2010,P.C.,AR Medical Art,207165,,,The corporation admitted to the charge of havi...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,


## Step 1: Classify some by hand

If you had a CSV with some sort of key in common, you'd be able to just do a join. But we don't! So **I'm going to help you out**.

I wrote this little script to help you **classify content by hand**. It will print the violation, then it's what you're looking for. If you type "y" or "Y" before hitting enter, that means YES. Once it's done it'll add the results to the dataframe in a column called `category`.

In [31]:
number_to_classify_by_hand = 2

In [32]:
def is_what_you_want(row):
    response = input("\n------------\n\n{desc}\n\n\nIS THIS WHAT YOU'RE LOOKING FOR? y for YES ".format(index=row.index, desc=row.misconduct))
    if response == "y" or response == "Y":
        print("\n** Classified as YES **")
        return "YES"
    else:
        print("\n** Classified as NO **")
        return "NO"

# Reset category column
df['category'] = np.nan
df['category'] = df[:number_to_classify_by_hand].apply(is_what_you_want, axis=1)

df.category.value_counts()


------------

The corporation admitted guilt to the charge of ordering excessive tests, treatment, or use of treatment facilities not warranted by the condition of a patient.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES y

** Classified as YES **

------------

The corporation admitted to the charge of having been convicted in New York Supreme Court, Kings County of a scheme to defraud in the first degree; falsifying business records; insurance fraud and failing to comply with the requirements of the New York State Business Corporation Law Section 1503(a).


IS THIS WHAT YOU'RE LOOKING FOR? y for YES y

** Classified as YES **


YES    2
Name: category, dtype: int64

## Step 2: Vectorize the violation descriptions

You want to **ONLY DO THIS WITH THE ONES YOU CLASSIFIED.**

In [38]:
#df = df.sample(frac=1).reset_index(drop=True)
df_classified = df[df['category' == 'YES']]

KeyError: False

In [34]:
df.category.value_counts()

YES    2
Name: category, dtype: int64

In [35]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words = 'english', max_features=3000)
matrix = vec.fit_transform(df['misconduct'].values.astype('U'))
features_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
features_df.head(10)

Unnamed: 0,00,000,01,02,028,03,04,05,06,061,...,wyoming,xanax,xenical,yacht,yag,year,years,yolo,yonkers,york
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Step 3: Create a classifier and train a model using the violation descriptions

You want to **ONLY DO THIS WITH THE ONES YOU CLASSIFIED.** You'll also need to make the `category` column a number, probably.

And remember your test/train split!

In [36]:
(df.category == 'YES').astype(int)

0       1
1       1
2       0
3       0
4       0
5       0
6       0
7       0
8       0
9       0
10      0
11      0
12      0
13      0
14      0
15      0
16      0
17      0
18      0
19      0
20      0
21      0
22      0
23      0
24      0
25      0
26      0
27      0
28      0
29      0
       ..
7110    0
7111    0
7112    0
7113    0
7114    0
7115    0
7116    0
7117    0
7118    0
7119    0
7120    0
7121    0
7122    0
7123    0
7124    0
7125    0
7126    0
7127    0
7128    0
7129    0
7130    0
7131    0
7132    0
7133    0
7134    0
7135    0
7136    0
7137    0
7138    0
7139    0
Name: category, Length: 7140, dtype: int64

In [42]:
df.head()

Unnamed: 0,action,date_updated,eff_date,first,last,lic_num,lic_type,middle,misconduct,order_pdf,restrictions,url,year_of_birth,category
0,Revocation of certificate of incorporation.,09/29/2010,09/29/2010,P.C.,563 Grand Medical,196275,,,The corporation admitted guilt to the charge o...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,,YES
1,Revocation of certificate of incorporation. P...,12/01/2010,12/08/2010,P.C.,AR Medical Art,207165,,,The corporation admitted to the charge of havi...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,,YES
2,License Surrender,,01/13/1999,Joseph,Aaron,72800,MD,,This action modifies the penalty previously im...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1927.0,
3,License limited until the physician's North Ca...,12/06/2005,12/13/2005,Mark,Aarons,161530,MD,Gold,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1958.0,
4,License surrender.,08/07/2013,08/14/2013,Jamsheed,Abadi,136045,MD,S,The physician did not contest the charge of fa...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1939.0,


In [39]:
from sklearn.preprocessing import LabelEncoder

In [40]:
le = LabelEncoder()

In [41]:
df['label'] = le.fit_transform(df.category)

TypeError: '>' not supported between instances of 'float' and 'str'

## Step 4: Test the classifier

How does it look? Remember, we're only using the classified ones so far!

**If you don't like its predicting ability**, go back up and play around with your vectorizer, and even with your classifier. There are a lot of options!

In [43]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

In [44]:
clf = BernoulliNB()

In [46]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    features_df.values, 
    df.label, 
    test_size=0.3)

AttributeError: 'DataFrame' object has no attribute 'label'

## Step 5: Vectorize the unclassified violations

Now we need to vectorize the violations we didn't classify by hand.

You **DO NOT MAKE A NEW VECTORIZOR**. You juse use the one we already have! Also, you **DON'T FIT IT AGAIN!** You just transform. I hope you read this line, but I'll give you some code anyway.

In [None]:
not_categorized = df[df.isnull(df.category)]

features_df = vec.transform(not_categorized.misconduct)
features_df.head()

## Step 6: Use the classifier to predict the labels of the unclassified violations

You **DON'T NEED A NEW CLASSIFIER**, use the one you have! You'll use `clf.predict`, and feed it... what? What does it need to predict the labels?

### Step 6.2: Those labels are ugly

If you used a `LabelEncoder` to create your categories, you can feed the numbers to `le.inverse_transform` to get actual text back.

### 6.3: Put the category labels back into the original dataframe

## Step 7: What actions were taken against those doctors?