# List Comprehensions and Lambda Functions

TODO:
* Creating list comprehensions to replace loops with a single line of code.
* Creating single use functions called lambda functions.

The data set we'll use in this project is in a format called [JavaScript Object Notation](https://www.json.org/) (JSON). As the name indicates, JSON originated from the JavaScript language, but has now become a language-independent format. From a Python perspective, JSON can be thought as a collection of Python objects nested inside each other.

![](https://s3.amazonaws.com/dq-content/355/json.svg)

The JSON above is a list, where each element in the list is a dictionary. Each of the dictionaries have the same keys, and one of the values of each dictionary is itself a list.

The Python [`json` module](https://docs.python.org/3.7/library/json.html#module-json) contains a number of functions to make working with JSON objects easier. We can use the [`json.loads()` method](https://docs.python.org/3.7/library/json.html#json.loads) to convert JSON data contained in a string to the equivalent set of Python objects:

# Libraries

In [13]:
import json
import requests
import pandas as pd

In [2]:
world_cup_str = """
[
    {
        "team_1": "France",
        "team_2": "Croatia",
        "game_type": "Final",
        "score" : [4, 2]
    },
    {
        "team_1": "Belgium",
        "team_2": "England",
        "game_type": "3rd/4th Playoff",
        "score" : [2, 0]
    }
]
"""
import json
world_cup_obj = json.loads(world_cup_str)
print(world_cup_str)


[
    {
        "team_1": "France",
        "team_2": "Croatia",
        "game_type": "Final",
        "score" : [4, 2]
    },
    {
        "team_1": "Belgium",
        "team_2": "England",
        "game_type": "3rd/4th Playoff",
        "score" : [2, 0]
    }
]



One of the places where the JSON format is commonly used is in the results returned by an [Application programming interface](https://en.wikipedia.org/wiki/Application_programming_interface) (API). APIs are interfaces that can be used to send and transmit data between different computer systems.

The data set from this project — [`hn_2014.json`](https://dsserver-prod-resources-1.s3.amazonaws.com/355/hn_2014.json?versionId=jhysd7HKgSD8phL3mgFUnGK5GQwCz1eg) — can be downloaded from the Hacker News API. It contains data about stories from Hacker News in 2014.

## Reading a JSON File



In [3]:
with open("hn_2014.json") as file:
    hn = json.loads(file.read())

In [4]:
type(hn)

list

Our `hn` variable is a list. Let's find out how many objects are in the list, and the type of the first object (which will almost always be the type of every object in the list in JSON data):

In [5]:
len(hn)

35806

In [6]:
type(hn[0])

dict

Our data set contains `35,806` dictionary objects, each representing a Hacker News story. In order to understand the format of our data set, we'll print the keys of the first dictionary:

In [7]:
hn[0].keys()

dict_keys(['author', 'numComments', 'points', 'url', 'storyText', 'createdAt', 'tags', 'createdAtI', 'title', 'objectId'])

Here is a summary of the keys and the data that they contain:

* `author`: The username of the person who submitted the story.
* `createdAt`: The date and time at which the story was created.
* `createdAtI`: An integer value representing the date and time at which the story was created.
* `numComments`: The number of comments that were made on the story.
* `objectId`: The unique identifier from Hacker News for the story.
* `points`: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes.
* `storyText`: The text of the story (if the story contains text).
* `tags`: A list of tags associated with the story.
* `title`: The title of the story.
* `url`: The URL that the story links to (if the story links to a URL).

# Deleting Dictionary Keys

In [8]:
hn[0]

{'author': 'dragongraphics',
 'numComments': 0,
 'points': 2,
 'url': 'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
 'storyText': '',
 'createdAt': '2014-05-29T08:07:50Z',
 'tags': ['story', 'author_dragongraphics', 'story_7815238'],
 'createdAtI': 1401350870,
 'title': 'Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability',
 'objectId': '7815238'}

You may notice that the `createdAt` and `createdAtI` keys both have the date and time data in two different formats. Because the format of `createdAt` is much easier to understand, let's do some data cleaning by deleting the `createdAtI` key from every dictionary.

In [9]:
def del_key(dict_, key):
    """delete a key in a dictionary"""
    # create a copy so we don't
    # modify the original dict
    modified_dict = dict_.copy()
    del modified_dict[key]
    return modified_dict
hn_clean = [del_key(dict_, 'createdAtI') for dict_ in hn]
hn_clean[0]

{'author': 'dragongraphics',
 'numComments': 0,
 'points': 2,
 'url': 'http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
 'storyText': '',
 'createdAt': '2014-05-29T08:07:50Z',
 'tags': ['story', 'author_dragongraphics', 'story_7815238'],
 'title': 'Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability',
 'objectId': '7815238'}

## URLs from each story.

In [10]:
urls = [dic['url'] for dic in hn_clean]
urls[:5]

['http://ashleynolan.co.uk/blog/are-we-getting-too-sassy',
 'http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot',
 'http://online.wsj.com/articles/apple-to-buy-beats-1401308971',
 'http://alexsblog.org/2014/05/29/dont-wait-for-inspiration/',
 'http://techcrunch.com/2014/05/28/hackerone-get-9m-in-series-a-funding-to-build-bug-tracking-bounty-programs/']

## count how many stories have more than 1,000 points.

In [11]:
len([dic for dic in hn_clean if dic['points']>1000])

8

# the story that has the greatest number of comments.

In [14]:
pd.DataFrame(max(hn_clean, key=lambda x: x['numComments']))

Unnamed: 0,author,numComments,points,url,storyText,createdAt,tags,title,objectId
0,platz,1208,889,https://blog.mozilla.org/blog/2014/04/03/brend...,,2014-04-03T19:02:53Z,story,Brendan Eich Steps Down as Mozilla CEO,7525198
1,platz,1208,889,https://blog.mozilla.org/blog/2014/04/03/brend...,,2014-04-03T19:02:53Z,author_platz,Brendan Eich Steps Down as Mozilla CEO,7525198
2,platz,1208,889,https://blog.mozilla.org/blog/2014/04/03/brend...,,2014-04-03T19:02:53Z,story_7525198,Brendan Eich Steps Down as Mozilla CEO,7525198


# lambda functions

![](https://s3.amazonaws.com/dq-content/355/lambda_1_anim.svg)

In [15]:
multiply = lambda a, b: a*b
multiply(100,5)

500

# posts that had the most points in 2014!

In [16]:
hn_sorted_points = sorted(hn_clean, key=lambda x: x['points'], reverse=True)
[n['title'] for n in hn_sorted_points[:5]]

['2048',
 'Today is The Day We Fight Back',
 'Wozniak: “Actually, the movie was largely a lie about me”',
 'Microsoft Open Sources C# Compiler',
 'Elon Musk: To the People of New Jersey']

# Reading JSON file into Pandas

In [17]:
hn_df = pd.DataFrame(hn_clean)
hn_df.head()

Unnamed: 0,author,numComments,points,url,storyText,createdAt,tags,title,objectId
0,dragongraphics,0,2,http://ashleynolan.co.uk/blog/are-we-getting-t...,,2014-05-29T08:07:50Z,"[story, author_dragongraphics, story_7815238]",Are we getting too Sassy? Weighing up micro-op...,7815238
1,jcr,0,1,http://spectrum.ieee.org/automaton/robotics/ho...,,2014-05-29T08:05:58Z,"[story, author_jcr, story_7815234]",Telemba Turns Your Old Roomba and Tablet Into ...,7815234
2,callum85,0,1,http://online.wsj.com/articles/apple-to-buy-be...,,2014-05-29T08:05:06Z,"[story, author_callum85, story_7815230]",Apple Agrees to Buy Beats for $3 Billion,7815230
3,d3v3r0,0,1,http://alexsblog.org/2014/05/29/dont-wait-for-...,,2014-05-29T08:00:08Z,"[story, author_d3v3r0, story_7815222]",Don’t wait for inspiration,7815222
4,timmipetit,0,1,http://techcrunch.com/2014/05/28/hackerone-get...,,2014-05-29T07:46:19Z,"[story, author_timmipetit, story_7815191]",HackerOne Get $9M In Series A Funding To Build...,7815191


# Exploring Tags using the Apply fuction

`tags` column is a column where each item contains the list of data from our original JSON.

At first glance, it looks like each values in this JSON list contain three items:

* The string `story`
* The name of the author
* The story ID

If that's the case, then the column doesn't contain any unique data, and we can remove it. We're going to analyze this column to make sure that's the case.

## look at the items where the list has four items

In [20]:
four_tags = hn_df.tags[hn_df.tags.apply(len)==4]
four_tags.head()

43     [story, author_alamgir_mand, story_7813869, sh...
86       [story, author_cweagans, story_7812404, ask_hn]
104    [story, author_nightstrike789, story_7812099, ...
107    [story, author_ISeemToBeAVerb, story_7812048, ...
109       [story, author_Swizec, story_7812018, show_hn]
Name: tags, dtype: object

It looks like whenever there are four tags, the extra tag is the last of the four.

In [22]:
hn_df['tags'] = hn_df.tags.apply(lambda tag: tag[-1] if len(tag)==4 else None)
hn_df.tags.head()

0    None
1    None
2    None
3    None
4    None
Name: tags, dtype: object

In [23]:
hn_df.tags.value_counts()

ask_hn     1348
show_hn     999
Name: tags, dtype: int64