# 3장. 링크드인 마이닝 : 직책 다면화, 동료들 클러스터링

* 바벨피쉬 / 파이썬을 이용한 자연어처리 기초 : 파트3 - '소셜 웹 마이닝'[1]
* 김무성

# 차례 

* 3.1 개요
* 3.2 링크드인 API 탐구
* 3.3 데이터 클러스터링에 대한 집중 훈련
* 3.4 정리

# 3.1 개요

* 링크드인 - https://www.linkedin.com/
* 링크드인의 데이터는 다른 소셜 네트워크와 본질적으로 상당히 다르다 
    - '세미정장 드레스코드를 가진 사적인 이벤트로 비유할 수 있을 것이다'
* 이 챕터에서는 몇가지 기본적인 데이터 마이닝 기법을 소개한다.
    - 인맥 중 직책과 같은 기준에 따랐을 때 가장 유사한 사람이 누구인가?
    - 인맥 중 여러분이 일하고 싶은 회사에서 일했던 사람은 누구인가?
    - 인맥들이 지리적으로 가장 많이 살고 있는 곳은 어디인가?
* 이 장에서 배울 내용
    - 링크드인 개발자 플랫폼 & API 요청하기
    - 일반적인 클러스터링 유형 세 가지와, 거의 모든 문제 영역에 적용되는 기본적인 머신러닝 주제
    - 데이터 정리(data cleansing ) 및 정규화(normalization)
    - 좌표 부여(Geocoding), 원문의 장소에 대한 언급으로부터 좌표 집합에 도달하기 위한 수단
    - 지리적 데이터를 구글 어스(Google Earth)와 통계 지도에 시각화하기
    

# 3.2 링크드인 API 탐구

* 3.2.1 링크드인 API 요청
* 3.2.2 링크드인 인맥을 CSV 파일로 내려받기

## 3.2.1 링크드인 API 요청

#### 링크드인 API 사용권한 획득 단계 [3,4,5]
* 1) 링크드인 개발자 플랫폼에서 Application 만들기
* 2) Client ID와 Client Secet 얻기
* 3) 인증요청
* 4) authorization_code 획득
* 5) access_token 획득
* 6) API 사용 테스트

#### 참고: 링크드인 API + oauth2 사용법
* https://developer.linkedin.com/docs/oauth2
* http://www.slideshare.net/KamyarMohager/o-auth-2-and-linked-inpdf

#### 1) 링크드인 개발자 플랫폼에서 Application 만들기

*  https://www.linkedin.com/secure/developer.



<img src="resources/images/LinkedIn-app.1.png" width="600px">

<img src="resources/images/LinkedIn-app.2.png" width="600px">

#### 2) Client ID와 Client Secet 얻기 

In [1]:
D_CLIENT_ID = '754srtu0p03l9l'
D_CLIENT_SECRET = 'tQ2fFx1XOjuE9FjM'

<img src="resources/images/LinkedIn-app.3.png" width="600px">

#### 3) 인증요청

##### 인증요청 단계 
* a) Redirect URLs를 등록한다
* b) 인증요청 API를 형식에 호출한다
    - API : https://www.linkedin.com/uas/oauth2/authorization
    - 파라미터
        - response_type
        - client_id
        - redirect_uri
        - state
        - scope

##### a) Redirect URLs를 등록한다 
* 아무 페이지나 등록해도 된다. 실제 웹에서 접근할수만 있으면 괜찮다.

In [2]:
D_REDIRECT_URL = 'http://babelpish.github.io/'

<img src="resources/images/LinkedIn-app.4.png" width="600px">

<img src="resources/images/LinkedIn-app.5.png" width="600px">

##### b) 인증요청 API를 형식에 호출한다.

 * GET 방식이므로 쉽게 하려면 웹브라우저 주소창에서 호출하자

In [4]:
AUTH_API = 'https://www.linkedin.com/uas/oauth2/authorization'
STATE_STR = '987654321'
SCOPE_STR = 'r_basicprofile'

AUTH_URL = AUTH_API + '?' \
                                + 'response_type=' + 'code' + '&' \
                                + 'client_id=' + D_CLIENT_ID + '&' \
                                + 'redirect_uri=' + D_REDIRECT_URL + '&' \
                                + 'state=' + STATE_STR + '&' \
                                + 'scope=' + SCOPE_STR 
                    
print AUTH_URL

https://www.linkedin.com/uas/oauth2/authorization?response_type=code&client_id=754srtu0p03l9l&redirect_uri=http://babelpish.github.io/&state=987654321&scope=r_basicprofile


####  4) authorization_code 획득

* 위의 URL을 웹브라우저에 넣는다.
* 그러면 아래와 같이 자신의 링크드인 계정에서 인증하는 화면이 나온다
* 아이디랑 암호 넣어서 인증해주자.

<img src="resources/images/LinkedIn-app.6.png" >

* 엑세스 허용하면
    - 우리가 설정했던 리다이렉트 페이지로 화면이 넘어가고
    - 브라우저 상단 주소가 변해있다.  
        - http://babelpish.github.io/?code=AQRhuylC_US9QtgttkPNNOwpfY9pyfvZAZ6Bs_9EKoA7qLAIQspU6jQwD9a_S--XH4tpsif_WnBN_42T-IaWflYvWgxDTl3mIO81E20zBqGfY2Mx9dY&state=987654321
    - 그 주소 안에 정보 중에 code라고 되어 있는 부분이 authorization_code 이다. 
        -  code=AQRhuylC_US9QtgttkPNNOwpfY9pyfvZAZ6Bs_9EKoA7qLAIQspU6jQwD9a_S--XH4tpsif_WnBN_42T-IaWflYvWgxDTl3mIO81E20zBqGfY2Mx9dY
        - authorization_code 
            -  AQRhuylC_US9QtgttkPNNOwpfY9pyfvZAZ6Bs_9EKoA7qLAIQspU6jQwD9a_S--XH4tpsif_WnBN_42T-IaWflYvWgxDTl3mIO81E20zBqGfY2Mx9dY  

In [5]:
D_AUTHORIZATION_CODE =  'AQRhuylC_US9QtgttkPNNOwpfY9pyfvZAZ6Bs_9EKoA7qLAIQspU6jQwD9a_S--XH4tpsif_WnBN_42T-IaWflYvWgxDTl3mIO81E20zBqGfY2Mx9dY'
print D_AUTHORIZATION_CODE

AQRhuylC_US9QtgttkPNNOwpfY9pyfvZAZ6Bs_9EKoA7qLAIQspU6jQwD9a_S--XH4tpsif_WnBN_42T-IaWflYvWgxDTl3mIO81E20zBqGfY2Mx9dY


##### 참고) Advanced REST client를 쓰면 편하다.
* https://chrome.google.com/webstore/detail/advanced-rest-client/hgmloofddffdnphfgcellkdfbfbjeloo

<img src="resources/images/LinkedIn-app.7.png" width="600px">

#### 5) access_token 획득

* 여기까지 왔으면, 다른 API들을 사용하기 위해 필수적인 access_token을 얻을 자격이 되었다.
* access_token 획득 API를 호출한다
    - API - https://www.linkedin.com/uas/oauth2/accessToken
    - 파라미터 
        - grant_type
        - code
        - redirect_uri
* 그런데 위의 authorization_code 얻은 후, 20초 이내로 토큰을 얻어야 한다. 즉, 빨리 API를 호출해야 한다.
* 문서상에는 POST 방식으로 해야한다고 하지만, GET도 된다. 즉, 얼른 API URL을 완성해서 브러우저 주소창에 넣을 것.    

In [7]:
ATOKEN_API = 'https://www.linkedin.com/uas/oauth2/accessToken'

ATOKEN_URL = ATOKEN_API + '?' \
                                + 'grant_type=' + 'authorization_code' + '&' \
                                + 'code=' + D_AUTHORIZATION_CODE + '&' \
                                + 'redirect_uri=' + D_REDIRECT_URL + '&' \
                                + 'client_id=' + D_CLIENT_ID + '&' \
                                + 'client_secret=' + D_CLIENT_SECRET
                    
print ATOKEN_URL

https://www.linkedin.com/uas/oauth2/accessToken?grant_type=authorization_code&code=AQRhuylC_US9QtgttkPNNOwpfY9pyfvZAZ6Bs_9EKoA7qLAIQspU6jQwD9a_S--XH4tpsif_WnBN_42T-IaWflYvWgxDTl3mIO81E20zBqGfY2Mx9dY&redirect_uri=http://babelpish.github.io/&client_id=754srtu0p03l9l&client_secret=tQ2fFx1XOjuE9FjM


* 성공하면 브라우저 화면에 결과가 다음처럼 나온다

{"access_token":"AQXfELi0qc3wEVDshes3VEaAi2y6aC_e88W6paqVQX4en53OVEOaEzuhhHzJnLNu-EW0bfo9SfcrshPYX0pJuz3uUbBQRvE6q4AJqTjgQchg-vEtslBBgwH5AUBcV1goGu_G4AcM5WAIvkI0vdyCKYxL1bEh1FZ3cLt-PMQLdzMae7XsRkM","expires_in":5183999}

In [8]:
D_ACCESS_TOKEN = 'AQXfELi0qc3wEVDshes3VEaAi2y6aC_e88W6paqVQX4en53OVEOaEzuhhHzJnLNu-EW0bfo9SfcrshPYX0pJuz3uUbBQRvE6q4AJqTjgQchg-vEtslBBgwH5AUBcV1goGu_G4AcM5WAIvkI0vdyCKYxL1bEh1FZ3cLt-PMQLdzMae7XsRkM'

#### 6) API 사용 테스트

In [9]:
PEOPLE_API = 'https://api.linkedin.com/v1/people/~'

PEOPLE_URL = PEOPLE_API + '?' \
                                + 'oauth2_access_token=' + D_ACCESS_TOKEN
                    
print PEOPLE_URL

https://api.linkedin.com/v1/people/~?oauth2_access_token=AQXfELi0qc3wEVDshes3VEaAi2y6aC_e88W6paqVQX4en53OVEOaEzuhhHzJnLNu-EW0bfo9SfcrshPYX0pJuz3uUbBQRvE6q4AJqTjgQchg-vEtslBBgwH5AUBcV1goGu_G4AcM5WAIvkI0vdyCKYxL1bEh1FZ3cLt-PMQLdzMae7XsRkM


* 헤더에 넣으려면 다음처럼 해준다

Authorization:  Bearer AQXfELi0qc3wEVDshes3VEaAi2y6aC_e88W6paqVQX4en53OVEOaEzuhhHzJnLNu-EW0bfo9SfcrshPYX0pJuz3uUbBQRvE6q4AJqTjgQchg-vEtslBBgwH5AUBcV1goGu_G4AcM5WAIvkI0vdyCKYxL1bEh1FZ3cLt-PMQLdzMae7XsRkM


* 요런 호출은 Advanced REST Client를 쓰면 좀 더 편하다.

<img src="resources/images/LinkedIn-app.8.png" >

#### 예제 3.1. 개발과 데이터 접근에 적합한 엑세스 토큰을 받기 위해 링크드인 OAuth 인증 증명 이용

* python-linkedin을 사용해서 좀 더 편하게 엑세스 토큰 발급을 받을 수 있다.
* http://ozgur.github.io/python-linkedin/
* 책의 예제는 ouath 1.0 방식이므로, 패키지 공식 사이트를 참조한 코드로 수정.

In [10]:
# !pip install python-linkedin

In [11]:
from linkedin import linkedin


API_KEY = D_CLIENT_ID
API_SECRET = D_CLIENT_SECRET
RETURN_URL = D_REDIRECT_URL

In [25]:
 linkedin.PERMISSIONS.enums.values()

['r_basicprofile',
 'rw_nus',
 'r_network',
 'r_contactinfo',
 'w_messages',
 'rw_groups',
 'r_emailaddress',
 'r_fullprofile']

In [37]:
perms = ['r_basicprofile',
         'r_emailaddress',
         'rw_company_admin',
         'w_share'
        ]

In [38]:
authentication = linkedin.LinkedInAuthentication(API_KEY, API_SECRET, RETURN_URL, perms) #[linkedin.PERMISSIONS.enums.values()[0]])
print authentication.authorization_url  # open this url on your browser

https://www.linkedin.com/uas/oauth2/authorization?scope=r_basicprofile%20r_emailaddress%20rw_company_admin%20w_share&state=e94ce89dc1b89a4eb207aa27e3ae534e&redirect_uri=http%3A//babelpish.github.io/&response_type=code&client_id=754srtu0p03l9l


In [39]:
application = linkedin.LinkedInApplication(authentication)

In [40]:
authentication.authorization_code = 'AQQKzOZKwSRSHPZiC65WlhEzvQpE19AG8baucXmd5RQfr1EzfRsS_KBjJz8-L_2jmvkwPiYzqOpytd573aVM4cDG0Wk15df0Alwayzz6i-S2tOaEXuc'
authentication.get_access_token()



AccessToken(access_token=u'AQUvfXDqVybCF2xM4YT6w-lBN5Y61aAdOgAceDJOXIw_G_z0PT4af3zHuvH5tGW9WaFbcWzj39TXCKkpEmRQySgxLzUkRK6LHP7aox3amOwvVlITkORwvUW96B0cD4kr57pQWdstLw_EiHS6CUDzPMnxCZ4L7hv-0-XeWAbWj5K_KmAqK-c', expires_in=5183999)

In [41]:
application.get_profile()



{u'firstName': u'\ubb34\uc131',
 u'headline': u'\ucf54\ub09c\ud14c\ud06c\ub180\ub85c\uc9c0 \uc5f0\uad6c\uc6d0',
 u'id': u'IKxFbKel6-',
 u'lastName': u'\uae40',
 u'siteStandardProfileRequest': {u'url': u'https://www.linkedin.com/profile/view?id=151455949&authType=name&authToken=7gdB&trk=api*a4598841*s4915561*'}}

## 예제 3-2. 링크드인 인맥을 추출하여 디스크에 저장

In [42]:
application.get_connections()



LinkedInError: 403 Client Error: Forbidden: Unknown Error

In [22]:
import json

connections = application.get_connections()

connections_data = 'resources/linkedin_connections.json'

f = open(connections_data, 'w')
f.write(json.dumps(connections, indent=1))
f.close()



LinkedInError: 403 Client Error: Forbidden: Unknown Error

In [None]:
# Execute this cell if you need to reload data...
import json
connections = json.loads(open('resources/ch03-linkedin/linkedin_connections.json').read())

**Note**: Should you need to revoke account access from your application or any other OAuth application, you can do so at [https://www.linkedin.com/secure/settings?userAgree=&goback=%2Enas_*1_*1_*1](https://www.linkedin.com/secure/settings?userAgree=&goback=%2Enas_*1_*1_*1)

## Example 3. Pretty-printing your LinkedIn connections' data

In [None]:
from prettytable import PrettyTable # pip install prettytable

pt = PrettyTable(field_names=['Name', 'Location'])
pt.align = 'l'

[ pt.add_row((c['firstName'] + ' ' + c['lastName'], c['location']['name'])) 
  for c in connections['values']
      if c.has_key('location')]

print pt

## Example 4. Displaying job position history for your profile and a connection's profile

In [None]:
import json

# See http://developer.linkedin.com/documents/profile-fields#fullprofile
# for details on additional field selectors that can be passed in for
# retrieving additional profile information.

# Display your own positions...

my_positions = app.get_profile(selectors=['positions'])
print json.dumps(my_positions, indent=1)

# Display positions for someone in your network...

# Get an id for a connection. We'll just pick the first one.
connection_id = connections['values'][0]['id']
connection_positions = app.get_profile(member_id=connection_id, 
                                       selectors=['positions'])
print json.dumps(connection_positions, indent=1)

## Example 5. Using field selector syntax to request additional details for APIs

In [None]:
# See http://developer.linkedin.com/documents/understanding-field-selectors
# for more information on the field selector syntax

my_positions = app.get_profile(selectors=['positions:(company:(name,industry,id))'])
print json.dumps(my_positions, indent=1)

## 3.2.2 링크드인 인맥을 CSV 파일로 내려받기

# 3.3 데이터 클러스터링에 대한 집중 훈련

* 3.3.1 사용자 경험을 증가시키는 클러스터링
* 3.3.2 분석을 위한 데이터 정규화
    - 3.3.2.1 회사를 정규화하고 개수 세기
    - 3.3.2.2 직책을 정규화하고 개수 세기
    - 3.3.2.3 위치를 정규화하고 개수 세기 
    - 3.3.2.4 통계지도로 좌표 시각화
* 3.3.3 유사도 측정
* 3.3.4 클러스터링 알고리즘
    - 3.3.4.1 그리디(Greedy) 클러스터링
    - 3.3.4.2 계층적 클러스터링
    - 3.3.4.3 K-평균 클러스터링
    - 3.3.4.4. 구글 어스를 이용한 지리적 클러스터링 시각화화화

## Example 6. Simple normalization of company suffixes from address book data

In [None]:
import os
import csv
from collections import Counter
from operator import itemgetter
from prettytable import PrettyTable

# XXX: Place your "Outlook CSV" formatted file of connections from 
# http://www.linkedin.com/people/export-settings at the following
# location: resources/ch03-linkedin/my_connections.csv

CSV_FILE = os.path.join("resources", "ch03-linkedin", 'my_connections.csv')

# Define a set of transforms that converts the first item
# to the second item. Here, we're simply handling some
# commonly known abbreviations, stripping off common suffixes, 
# etc.

transforms = [(', Inc.', ''), (', Inc', ''), (', LLC', ''), (', LLP', ''),
               (' LLC', ''), (' Inc.', ''), (' Inc', '')]

csvReader = csv.DictReader(open(CSV_FILE), delimiter=',', quotechar='"')
contacts = [row for row in csvReader]
companies = [c['Company'].strip() for c in contacts if c['Company'].strip() != '']

for i, _ in enumerate(companies):
    for transform in transforms:
        companies[i] = companies[i].replace(*transform)

pt = PrettyTable(field_names=['Company', 'Freq'])
pt.align = 'l'
c = Counter(companies)
[pt.add_row([company, freq]) 
 for (company, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) 
     if freq > 1]
print pt

## Example 7. Standardizing common job titles and computing their frequencies

In [None]:
import os
import csv
from operator import itemgetter
from collections import Counter
from prettytable import PrettyTable

# XXX: Place your "Outlook CSV" formatted file of connections from 
# http://www.linkedin.com/people/export-settings at the following
# location: resources/ch03-linkedin/my_connections.csv

CSV_FILE = os.path.join("resources", "ch03-linkedin", 'my_connections.csv')

transforms = [
    ('Sr.', 'Senior'),
    ('Sr', 'Senior'),
    ('Jr.', 'Junior'),
    ('Jr', 'Junior'),
    ('CEO', 'Chief Executive Officer'),
    ('COO', 'Chief Operating Officer'),
    ('CTO', 'Chief Technology Officer'),
    ('CFO', 'Chief Finance Officer'),
    ('VP', 'Vice President'),
    ]

csvReader = csv.DictReader(open(CSV_FILE), delimiter=',', quotechar='"')
contacts = [row for row in csvReader]

# Read in a list of titles and split apart
# any combined titles like "President/CEO."
# Other variations could be handled as well, such
# as "President & CEO", "President and CEO", etc.

titles = []
for contact in contacts:
    titles.extend([t.strip() for t in contact['Job Title'].split('/')
                  if contact['Job Title'].strip() != ''])

# Replace common/known abbreviations

for i, _ in enumerate(titles):
    for transform in transforms:
        titles[i] = titles[i].replace(*transform)

# Print out a table of titles sorted by frequency

pt = PrettyTable(field_names=['Title', 'Freq'])
pt.align = 'l'
c = Counter(titles)
[pt.add_row([title, freq]) 
 for (title, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) 
     if freq > 1]
print pt

# Print out a table of tokens sorted by frequency

tokens = []
for title in titles:
    tokens.extend([t.strip(',') for t in title.split()])
pt = PrettyTable(field_names=['Token', 'Freq'])
pt.align = 'l'
c = Counter(tokens)
[pt.add_row([token, freq]) 
 for (token, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) 
     if freq > 1 and len(token) > 2]
print pt

## Example 8. Geocoding locations with Microsoft Bing

In [None]:
from geopy import geocoders

GEO_APP_KEY = '' # XXX: Get this from https://www.bingmapsportal.com
g = geocoders.Bing(GEO_APP_KEY)
print g.geocode("Nashville", exactly_one=False)

## Example 9. Geocoding locations of LinkedIn connections with Microsoft Bing

In [None]:
from geopy import geocoders

GEO_APP_KEY = '' # XXX: Get this from https://www.bingmapsportal.com
g = geocoders.Bing(GEO_APP_KEY)

transforms = [('Greater ', ''), (' Area', '')]

results = {}
for c in connections['values']:
    if not c.has_key('location'): continue
        
    transformed_location = c['location']['name']
    for transform in transforms:
        transformed_location = transformed_location.replace(*transform)
    geo = g.geocode(transformed_location, exactly_one=False)
    if geo == []: continue
    results.update({ c['location']['name'] : geo })
    
print json.dumps(results, indent=1)

## Example 10. Parsing out states from Bing geocoder results using a regular expression

In [None]:
import re

# Most results contain a response that can be parsed by
# picking out the first two consecutive upper case letters 
# as a clue for the state
pattern = re.compile('.*([A-Z]{2}).*')
    
def parseStateFromBingResult(r):
    result = pattern.search(r[0][0])
    if result == None: 
        print "Unresolved match:", r
        return "???"
    elif len(result.groups()) == 1:
        print result.groups()
        return result.groups()[0]
    else:
        print "Unresolved match:", result.groups()
        return "???"

    
transforms = [('Greater ', ''), (' Area', '')]

results = {}
for c in connections['values']:
    if not c.has_key('location'): continue
    if not c['location']['country']['code'] == 'us': continue
        
    transformed_location = c['location']['name']
    for transform in transforms:
        transformed_location = transformed_location.replace(*transform)
    
    geo = g.geocode(transformed_location, exactly_one=False)
    if geo == []: continue
    parsed_state = parseStateFromBingResult(geo)
    if parsed_state != "???":
        results.update({c['location']['name'] : parsed_state})
    
print json.dumps(results, indent=1)

**Here's how to power a Cartogram visualization with the data from the "results" variable**

In [None]:
import os
import json
from IPython.display import IFrame
from IPython.core.display import display

# Load in a data structure mapping state names to codes.
# e.g. West Virginia is WV
codes = json.loads(open('resources/ch03-linkedin/viz/states-codes.json').read())

from collections import Counter
c = Counter([r[1] for r in results.items()])
states_freqs = { codes[k] : v for (k,v) in c.items() }

# Lace in all of the other states and provide a minimum value for each of them
states_freqs.update({v : 0.5 for v in codes.values() if v not in states_freqs.keys() })

# Write output to file
f = open('resources/ch03-linkedin/viz/states-freqs.json', 'w')
f.write(json.dumps(states_freqs, indent=1))
f.close()

# IPython Notebook can serve files and display them into
# inline frames. Prepend the path with the 'files' prefix

display(IFrame('files/resources/ch03-linkedin/viz/cartogram.html', '100%', '600px'))

## Example 11. Using NLTK to compute bigrams

In [None]:
import nltk
ceo_bigrams = nltk.bigrams("Chief Executive Officer".split(), pad_right=True, 
                                                              pad_left=True)
cto_bigrams = nltk.bigrams("Chief Technology Officer".split(), pad_right=True, 
                                                               pad_left=True)

print ceo_bigrams
print cto_bigrams
print len(set(ceo_bigrams).intersection(set(cto_bigrams)))

## Example 12. Clustering job titles using a greedy heuristic

In [None]:
import os
import csv
from nltk.metrics.distance import jaccard_distance

# XXX: Place your "Outlook CSV" formatted file of connections from 
# http://www.linkedin.com/people/export-settings at the following
# location: resources/ch03-linkedin/my_connections.csv

CSV_FILE = os.path.join("resources", "ch03-linkedin", 'my_connections.csv')

# Tweak this distance threshold and try different distance calculations 
# during experimentation
DISTANCE_THRESHOLD = 0.5
DISTANCE = jaccard_distance

def cluster_contacts_by_title(csv_file):

    transforms = [
        ('Sr.', 'Senior'),
        ('Sr', 'Senior'),
        ('Jr.', 'Junior'),
        ('Jr', 'Junior'),
        ('CEO', 'Chief Executive Officer'),
        ('COO', 'Chief Operating Officer'),
        ('CTO', 'Chief Technology Officer'),
        ('CFO', 'Chief Finance Officer'),
        ('VP', 'Vice President'),
        ]

    separators = ['/', ' and ', '&']

    csvReader = csv.DictReader(open(csv_file), delimiter=',', quotechar='"')
    contacts = [row for row in csvReader]

    # Normalize and/or replace known abbreviations
    # and build up a list of common titles.

    all_titles = []
    for i, _ in enumerate(contacts):
        if contacts[i]['Job Title'] == '':
            contacts[i]['Job Titles'] = ['']
            continue
        titles = [contacts[i]['Job Title'].strip()]
        for title in titles:
            for separator in separators:
                if title.find(separator) >= 0:
                    titles.remove(title.strip())
                    titles.extend([title.strip() for title in title.split(separator)
                                  if title.strip() != ''])

        for transform in transforms:
            titles = [title.replace(*transform) for title in titles]
        contacts[i]['Job Titles'] = titles
        all_titles.extend(titles)

    all_titles = list(set(all_titles))

    clusters = {}
    for title1 in all_titles:
        clusters[title1] = []
        for title2 in all_titles:
            if title2 in clusters[title1] or clusters.has_key(title2) and title1 \
                in clusters[title2]:
                continue
            distance = DISTANCE(set(title1.split()), set(title2.split()))

            if distance < DISTANCE_THRESHOLD:
                clusters[title1].append(title2)

    # Flatten out clusters

    clusters = [clusters[title] for title in clusters if len(clusters[title]) > 1]

    # Round up contacts who are in these clusters and group them together

    clustered_contacts = {}
    for cluster in clusters:
        clustered_contacts[tuple(cluster)] = []
        for contact in contacts:
            for title in contact['Job Titles']:
                if title in cluster:
                    clustered_contacts[tuple(cluster)].append('%s %s'
                            % (contact['First Name'], contact['Last Name']))

    return clustered_contacts


clustered_contacts = cluster_contacts_by_title(CSV_FILE)
print clustered_contacts
for titles in clustered_contacts:
    common_titles_heading = 'Common Titles: ' + ', '.join(titles)

    descriptive_terms = set(titles[0].split())
    for title in titles:
        descriptive_terms.intersection_update(set(title.split()))
    descriptive_terms_heading = 'Descriptive Terms: ' \
        + ', '.join(descriptive_terms)
    print descriptive_terms_heading
    print '-' * max(len(descriptive_terms_heading), len(common_titles_heading))
    print '\n'.join(clustered_contacts[titles])
    print

**Incorporating random sampling can improve performance of the nested loops in Example 12**

In [None]:
import os
import csv
import random
from nltk.metrics.distance import jaccard_distance

# XXX: Place your "Outlook CSV" formatted file of connections from 
# http://www.linkedin.com/people/export-settings at the following
# location: resources/ch03-linkedin/my_connections.csv

CSV_FILE = os.path.join("resources", "ch03-linkedin", 'my_connections.csv')

# Tweak this distance threshold and try different distance calculations 
# during experimentation
DISTANCE_THRESHOLD = 0.5
DISTANCE = jaccard_distance

# Adjust sample size as needed to reduce the runtime of the
# nested loop that invokes the DISTANCE function
SAMPLE_SIZE = 500

def cluster_contacts_by_title(csv_file):

    transforms = [
        ('Sr.', 'Senior'),
        ('Sr', 'Senior'),
        ('Jr.', 'Junior'),
        ('Jr', 'Junior'),
        ('CEO', 'Chief Executive Officer'),
        ('COO', 'Chief Operating Officer'),
        ('CTO', 'Chief Technology Officer'),
        ('CFO', 'Chief Finance Officer'),
        ('VP', 'Vice President'),
        ]

    separators = ['/', ' and ', '&']

    csvReader = csv.DictReader(open(csv_file), delimiter=',', quotechar='"')
    contacts = [row for row in csvReader]

    # Normalize and/or replace known abbreviations
    # and build up list of common titles

    all_titles = []
    for i, _ in enumerate(contacts):
        if contacts[i]['Job Title'] == '':
            contacts[i]['Job Titles'] = ['']
            continue
        titles = [contacts[i]['Job Title'].strip()]
        for title in titles:
            for separator in separators:
                if title.find(separator) >= 0:
                    titles.remove(title.strip())
                    titles.extend([title.strip() for title in title.split(separator)
                                  if title.strip() != ''])

        for transform in transforms:
            titles = [title.replace(*transform) for title in titles]
        contacts[i]['Job Titles'] = titles
        all_titles.extend(titles)

    all_titles = list(set(all_titles))
    clusters = {}
    for title1 in all_titles:
        clusters[title1] = []
        for sample in xrange(SAMPLE_SIZE):
            title2 = all_titles[random.randint(0, len(all_titles)-1)]
            if title2 in clusters[title1] or clusters.has_key(title2) and title1 \
                in clusters[title2]:
                continue
            distance = DISTANCE(set(title1.split()), set(title2.split()))
            if distance < DISTANCE_THRESHOLD:
                clusters[title1].append(title2)

    # Flatten out clusters

    clusters = [clusters[title] for title in clusters if len(clusters[title]) > 1]

    # Round up contacts who are in these clusters and group them together

    clustered_contacts = {}
    for cluster in clusters:
        clustered_contacts[tuple(cluster)] = []
        for contact in contacts:
            for title in contact['Job Titles']:
                if title in cluster:
                    clustered_contacts[tuple(cluster)].append('%s %s'
                            % (contact['First Name'], contact['Last Name']))

    return clustered_contacts


clustered_contacts = cluster_contacts_by_title(CSV_FILE)
print clustered_contacts
for titles in clustered_contacts:
    common_titles_heading = 'Common Titles: ' + ', '.join(titles)

    descriptive_terms = set(titles[0].split())
    for title in titles:
        descriptive_terms.intersection_update(set(title.split()))
    descriptive_terms_heading = 'Descriptive Terms: ' \
        + ', '.join(descriptive_terms)
    print descriptive_terms_heading
    print '-' * max(len(descriptive_terms_heading), len(common_titles_heading))
    print '\n'.join(clustered_contacts[titles])
    print

**How to export data (contained in the "clustered contacts" variable) to power faceted display as outlined in Figure 3.**

In [None]:
import json
import os
from IPython.display import IFrame
from IPython.core.display import display

data = {"label" : "name", "temp_items" : {}, "items" : []} 
for titles in clustered_contacts:
    descriptive_terms = set(titles[0].split())
    for title in titles:
        descriptive_terms.intersection_update(set(title.split()))
    descriptive_terms = ', '.join(descriptive_terms)

    if data['temp_items'].has_key(descriptive_terms):
        data['temp_items'][descriptive_terms].extend([{'name' : cc } for cc 
            in clustered_contacts[titles]])
    else:
        data['temp_items'][descriptive_terms] = [{'name' : cc } for cc 
            in clustered_contacts[titles]]

for descriptive_terms in data['temp_items']:
    data['items'].append({"name" : "%s (%s)" % (descriptive_terms, 
        len(data['temp_items'][descriptive_terms]),),
                              "children" : [i for i in 
                              data['temp_items'][descriptive_terms]]})

del data['temp_items']

# Open the template and substitute the data

TEMPLATE = 'resources/ch03-linkedin/viz/dojo_tree.html.template'                                                
OUT = 'resources/ch03-linkedin/viz/dojo_tree.html'

viz_file = 'files/resources/ch03-linkedin/viz/dojo_tree.html'

t = open(TEMPLATE).read()
f = open(OUT, 'w')
f.write(t % json.dumps(data, indent=4))
f.close()

# IPython Notebook can serve files and display them into
# inline frames. Prepend the path with the 'files' prefix

display(IFrame(viz_file, '400px', '600px'))

**How to export data to power a dendogram and node-link tree visualization as outlined in Figure 4.**

In [None]:
import os
import csv
import random
from nltk.metrics.distance import jaccard_distance
from cluster import HierarchicalClustering

# XXX: Place your "Outlook CSV" formatted file of connections from 
# http://www.linkedin.com/people/export-settings at the following
# location: resources/ch03-linkedin/my_connections.csv

CSV_FILE = os.path.join("resources", "ch03-linkedin", 'my_connections.csv')

OUT_FILE = 'resources/ch03-linkedin/viz/d3-data.json'

# Tweak this distance threshold and try different distance calculations 
# during experimentation
DISTANCE_THRESHOLD = 0.5
DISTANCE = jaccard_distance

# Adjust sample size as needed to reduce the runtime of the
# nested loop that invokes the DISTANCE function
SAMPLE_SIZE = 500

def cluster_contacts_by_title(csv_file):

    transforms = [
        ('Sr.', 'Senior'),
        ('Sr', 'Senior'),
        ('Jr.', 'Junior'),
        ('Jr', 'Junior'),
        ('CEO', 'Chief Executive Officer'),
        ('COO', 'Chief Operating Officer'),
        ('CTO', 'Chief Technology Officer'),
        ('CFO', 'Chief Finance Officer'),
        ('VP', 'Vice President'),
        ]

    separators = ['/', 'and', '&']

    csvReader = csv.DictReader(open(csv_file), delimiter=',', quotechar='"')
    contacts = [row for row in csvReader]

    # Normalize and/or replace known abbreviations
    # and build up list of common titles

    all_titles = []
    for i, _ in enumerate(contacts):
        if contacts[i]['Job Title'] == '':
            contacts[i]['Job Titles'] = ['']
            continue
        titles = [contacts[i]['Job Title']]
        for title in titles:
            for separator in separators:
                if title.find(separator) >= 0:
                    titles.remove(title)
                    titles.extend([title.strip() for title in title.split(separator)
                                  if title.strip() != ''])

        for transform in transforms:
            titles = [title.replace(*transform) for title in titles]
        contacts[i]['Job Titles'] = titles
        all_titles.extend(titles)

    all_titles = list(set(all_titles))
    
    # Define a scoring function
    def score(title1, title2): 
        return DISTANCE(set(title1.split()), set(title2.split()))

    # Feed the class your data and the scoring function
    hc = HierarchicalClustering(all_titles, score)

    # Cluster the data according to a distance threshold
    clusters = hc.getlevel(DISTANCE_THRESHOLD)

    # Remove singleton clusters
    clusters = [c for c in clusters if len(c) > 1]

    # Round up contacts who are in these clusters and group them together

    clustered_contacts = {}
    for cluster in clusters:
        clustered_contacts[tuple(cluster)] = []
        for contact in contacts:
            for title in contact['Job Titles']:
                if title in cluster:
                    clustered_contacts[tuple(cluster)].append('%s %s'
                            % (contact['First Name'], contact['Last Name']))

    return clustered_contacts

def display_output(clustered_contacts):
    
    for titles in clustered_contacts:
        common_titles_heading = 'Common Titles: ' + ', '.join(titles)

        descriptive_terms = set(titles[0].split())
        for title in titles:
            descriptive_terms.intersection_update(set(title.split()))
        descriptive_terms_heading = 'Descriptive Terms: ' \
            + ', '.join(descriptive_terms)
        print descriptive_terms_heading
        print '-' * max(len(descriptive_terms_heading), len(common_titles_heading))
        print '\n'.join(clustered_contacts[titles])
        print

def write_d3_json_output(clustered_contacts):
    
    json_output = {'name' : 'My LinkedIn', 'children' : []}

    for titles in clustered_contacts:

        descriptive_terms = set(titles[0].split())
        for title in titles:
            descriptive_terms.intersection_update(set(title.split()))

        json_output['children'].append({'name' : ', '.join(descriptive_terms)[:30], 
                                    'children' : [ {'name' : c.decode('utf-8', 'replace')} for c in clustered_contacts[titles] ] } )
    
        f = open(OUT_FILE, 'w')
        f.write(json.dumps(json_output, indent=1))
        f.close()
    
clustered_contacts = cluster_contacts_by_title(CSV_FILE)
display_output(clustered_contacts)
write_d3_json_output(clustered_contacts)

**Once you've run the code and produced the output for the dendogram and node-link tree visualizations, here's one way to serve it.**

In [None]:
import os
from IPython.display import IFrame
from IPython.core.display import display

# IPython Notebook can serve files and display them into
# inline frames. Prepend the path with the 'files' prefix

viz_file = 'files/resources/ch03-linkedin/viz/node_link_tree.html'

# XXX: Another visualization you could try:
#viz_file = 'files/resources/ch03-linkedin/viz/dendogram.html'

display(IFrame(viz_file, '100%', '600px'))

## Example 13. Clustering your LinkedIn professional network based upon the locations of your connections and emitting KML output for visualization with Google Earth

In [None]:
import os
import sys
import json
from urllib2 import HTTPError
from geopy import geocoders
from cluster import KMeansClustering, centroid

# A helper function to munge data and build up an XML tree.
# It references some code tucked away in another directory, so we have to
# add that directory to the PYTHONPATH for it to be picked up.
sys.path.append(os.path.join(os.getcwd(), "resources", "ch03-linkedin"))
from linkedin__kml_utility import createKML

# XXX: Try different values for K to see the difference in clusters that emerge

K = 3

# XXX: Get an API key and pass it in here. See https://www.bingmapsportal.com.
GEO_API_KEY = ''
g = geocoders.Bing(GEO_API_KEY)

# Load this data from where you've previously stored it

CONNECTIONS_DATA = 'resources/ch03-linkedin/linkedin_connections.json'

OUT_FILE = "resources/ch03-linkedin/viz/linkedin_clusters_kmeans.kml"

# Open up your saved connections with extended profile information
# or fetch them again from LinkedIn if you prefer

connections = json.loads(open(CONNECTIONS_DATA).read())['values']

locations = [c['location']['name'] for c in connections if c.has_key('location')]

# Some basic transforms may be necessary for geocoding services to function properly
# Here are a couple that seem to help.

transforms = [('Greater ', ''), (' Area', '')]

# Step 1 - Tally the frequency of each location

coords_freqs = {}
for location in locations:

    if not c.has_key('location'): continue
    
    # Avoid unnecessary I/O and geo requests by building up a cache

    if coords_freqs.has_key(location):
        coords_freqs[location][1] += 1
        continue
    transformed_location = location

    for transform in transforms:
        transformed_location = transformed_location.replace(*transform)
        
        # Handle potential I/O errors with a retry pattern...
        
        while True:
            num_errors = 0
            try:
                results = g.geocode(transformed_location, exactly_one=False)
                break
            except HTTPError, e:
                num_errors += 1
                if num_errors >= 3:
                    sys.exit()
                print >> sys.stderr, e
                print >> sys.stderr, 'Encountered an urllib2 error. Trying again...'
                
        for result in results:
            # Each result is of the form ("Description", (X,Y))
            coords_freqs[location] = [result[1], 1]
            break # Disambiguation strategy is "pick first"

# Step 2 - Build up data structure for converting locations to KML            
            
# Here, you could optionally segment locations by continent or country
# so as to avoid potentially finding a mean in the middle of the ocean.
# The k-means algorithm will expect distinct points for each contact, so
# build out an expanded list to pass it.

expanded_coords = []
for label in coords_freqs:
    # Flip lat/lon for Google Earth
    ((lat, lon), f) = coords_freqs[label]
    expanded_coords.append((label, [(lon, lat)] * f))

# No need to clutter the map with unnecessary placemarks...

kml_items = [{'label': label, 'coords': '%s,%s' % coords[0]} for (label,
             coords) in expanded_coords]

# It would also be helpful to include names of your contacts on the map

for item in kml_items:
    item['contacts'] = '\n'.join(['%s %s.' % (c['firstName'], c['lastName'])
        for c in connections if c.has_key('location') and 
                                c['location']['name'] == item['label']])

# Step 3 - Cluster locations and extend the KML data structure with centroids
    
cl = KMeansClustering([coords for (label, coords_list) in expanded_coords
                      for coords in coords_list])

centroids = [{'label': 'CENTROID', 'coords': '%s,%s' % centroid(c)} for c in
             cl.getclusters(K)]

kml_items.extend(centroids)

# Step 4 - Create the final KML output and write it to a file

kml = createKML(kml_items)

f = open(OUT_FILE, 'w')
f.write(kml)
f.close()

print 'Data written to ' + OUT

# 3.4 정리

# 참고자료 

* [1] 소셜 웹 마이닝 소셜미디어 데이터 마이닝 및 분석 2판 - http://www.kyobobook.co.kr/product/detailViewKor.laf?ejkGb=KOR&mallGb=KOR&barcode=9788994774893
* [2] 저자 소스코드 - http://bit.ly/16kGNyb
* [3] 링크드인 개발자 플랫폼 - https://www.linkedin.com/secure/developer
* [4] 링크드인 auth2 관련 공식 문서 - https://developer.linkedin.com/docs/oauth2
* [5] 링크드인 api + auth2  슬라이드 - http://www.slideshare.net/KamyarMohager/o-auth-2-and-linked-inpdf