# Data Wrangling Template

## Gather

In [1]:
import zipfile
import pandas as pd

From the documetation, we know ZipFile is also a context manager and therefore supports the with statement. 

In [6]:
# use with can save resources and do not need to use close()
with zipfile.ZipFile('armenian-online-job-postings.zip','r') as myzip:
    myzip.extractall()

[old version of pandas course](https://classroom.udacity.com/courses/ud170/lessons/5430778793/concepts/54059491250923)

In [2]:
# read csv
df = pd.read_csv('online-job-postings.csv')

In [3]:
df.head()

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True


## Assess

#### Quality

Low quality data is commonly referred to as dirty data. Dirty data has issues with its content.
Common data quality issues include :

1. missing data, like the missing height value for Juan.
2. invalid data, like a cell having an impossible value, e.g., like negative height value for Kwasi. Having "inches" and "centimetres" in the height entries is technically invalid as well, since the datatype for height becomes a string when those are present. The datatype for height should be integer or float.
3. inaccurate data, like Jane actually being 58 inches tall, not 55 inches tall.
4. inconsistent data, like using different units for height (inches and centimetres).

Data quality is a perception or an assessment of data's fitness to serve its purpose in a given context. Unfortunately, that’s a bit of an evasive definition but it gets to something important: there are no hard and fast rules for data quality. One dataset may be high enough quality for one application but not for another.

#### Tidiness

Untidy data is commonly referred to as "messy" data. Messy data has issues with its structure.

Tidy data is a relatively new concept coined by statistician, professor, and all-round data expert [Hadley Wickham](http://hadley.nz/). 

"It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected. To get a handle on the problem, this paper focuses on a small, but important, aspect of data cleaning that I call data tidying: structuring datasets to facilitate analysis."

this [Tidy Data in Python](http://www.jeannicholashould.com/tidy-data-in-python.html) article by Jean-Nicholas Hould is a good start.

In [6]:
df.head()

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True


- Missing values (NaN)
- StartData inconsistencies

 It is best if you document all your assessments at the very bottom of the Assess section in the data wrangling template, i.e., directly above the Clean heading. Referring to them when defining cleaning operations is easier this way and prevents hectic scrolling.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19001 entries, 0 to 19000
Data columns (total 24 columns):
jobpost             19001 non-null object
date                19001 non-null object
Title               18973 non-null object
Company             18994 non-null object
AnnouncementCode    1208 non-null object
Term                7676 non-null object
Eligibility         4930 non-null object
Audience            640 non-null object
StartDate           9675 non-null object
Duration            10798 non-null object
Location            18969 non-null object
JobDescription      15109 non-null object
JobRequirment       16479 non-null object
RequiredQual        18517 non-null object
Salary              9622 non-null object
ApplicationP        18941 non-null object
OpeningDate         18295 non-null object
Deadline            18936 non-null object
Notes               2211 non-null object
AboutC              12470 non-null object
Attach              1559 non-null object
Year              

- Fix nondescriptive column headers (ApplicationP, AboutC, RwquiredQual,      and  JobRequirement)

In [12]:
df.nunique()

jobpost             18892
date                 4391
Title                8636
Company              4554
AnnouncementCode     1014
Term                  411
Eligibility           663
Audience              216
StartDate            1186
Duration             1515
Location              759
JobDescription      12861
JobRequirment       14182
RequiredQual        16688
Salary               2692
ApplicationP        14187
OpeningDate          3344
Deadline             5202
Notes                1031
AboutC               6016
Attach               1495
Year                   12
Month                  12
IT                      2
dtype: int64

### About tidiness
This datafram is not tidy, like there are 3 columns about data, and there are company data and post data

you might not use tidy data because some computations are more efficient if the data is in a different format. 

Many examples from graphical models, to genomics, to neuroimaging, to social sciences rely on some kind of linear algebra based computations (matrix multiplication, singular value decompositions, eigen decompositions, etc.) which are all optimized to work on matrices, not tidy data frames. 

This approach is again very fast, optimized for the calculations being performed and performs much better than the one-by-one regression approach. But it requires the data in matrix or expression set format. Which brings us to the second general point: **you might not use tidy data because many functions require a different, also very clean and useful data format, and you don’t want to have to constantly be switching back and forth.** Again, this requires you to be more specific to your application, but the potential payoffs can be really big as in the case of limma.
