# Collecting artist data

## Activity
30 March 2020:
- collected the list of artist names
- used python to create the list of urls
- learned some regex and lambda
- requested the data and used except to skip the urls that didn't have the bio content
- created a data frame using pandas with that contains the artis names + their bios from artnet
- pickled the files

## Introduction

This notebook is based on https://github.com/adashofdata/nlp-in-python-tutorial, thank you Alice Zhao

Steps:
1. **Getting the data - **in this case, we'll be scraping data from a some art websites
2. **Cleaning the data - **we will walk through popular text pre-processing techniques
3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

## Problem Statement

I'd like to find out how current artists describe themselves, and whether that has an impact on their popularity.

## Getting The Data

I will use the artnet.com database to import the artist statements or brief descriptions of their style and work. I presume most of these have been drafted over time and by the galleries that represent them. Since I am only interested in living artists, I will look for top lists of living painters today. According to [mentalfloss](https://www.mentalfloss.com/article/80830/100-most-collectible-living-artists)... these are:
Top 100 Living Artists (2011-2016)

Gerhard Richter (b. 1932)
Jeff Koons (b. 1955)
Christopher Wool (b. 1955)
Cul Ruzhuo (b. 1944)
Zeng Fanzhi (b. 1964)
Yayoi Kusama (b. 1929)
Richard Prince (b. 1949)
Peter Doig (b. 1959)
Fan Zeng (b. 1938)
Ed Ruscha (b. 1937)
Damien Hirst (b. 1965)
Zhou Chunya (b. 1955)
Zhang Xiaogang (b. 1958)
Robert Ryman (b. 1930)
Wayne Thiebaud (b. 1920)
He Jiaying (b. 1957)
Huang Yongyu (b. 1924)
Fernando Bolero (b. 1932)
Frank Stella (b. 1936)
Rudolf Stingel (b. 1956)
Jasper Johns (b. 1930)
Yoshitomo Nara (b. 1959)
Liu Wei (b. 1965)
Ju Ming (b. 1938)
Pierre Soulages (b. 1919)
Georg Baselitz (b. 1938)
Lee Wan (b. 1936)
David Hockney (b. 1937)
Anselm Kiefer (b. 1945)
Enrico Castellani (b. 1930)
Cindy Sherman (b. 1954)
Anish Kapoor (b. 1954)
Mark Grotjahn (b. 1968)
Takashi Murakaml (b. 1962)
Luo Zhongli (b. 1949)
Robert Indiana (b. 1928)
Andreas Gursky (b. 1955)
Michelangelo Pistoletto (b. 1933)
Wade Guyton (b. 1972)
Wang Yidong (b. 1955)
Chen Peiqiu (b. 1923)
Brice Marden (b. 1938)
Jin Shangyi (b. 1934)
Fang Lijun (b. 1963)
Frank Auerbach (b. 1931)
Yang Feiyun (b. 1954)
Wang Mingming (b. 1952)
Shi Guoliang (b. 1956)
Liu DaWei (b. 1945)
Mark Tansey (b. 1949)
Gunther Uecker (b. 1930)
Jia Youfu (b. 1942)
Mark Bradford (b. 1961)
Ai Xuan (b. 1947)
Liu Ye (b. 1964)
Maurizio Cahalan (b. 1960)
Syed Haider Raza (b. 1922)
George Condo (b. 1957)
John Currin (b. 1962)
Glenn Brown (b. 1966)
Liu Xiaodong (b. 1963)
Urs Fischer (b. 1973)
Miquel Barcelo (b. 1957)
Thomas Schutte (b. 1954)
Liu Kuo Sung (b. 1932)
Glenn Ligon (b. 1960)
Ai Weiwei (b. 1957)
Sean Scully (b. 1945)
Fang Chuxiong (b. 1950)
Bruce Nauman (b. 1941)
Wang Ziwu (b. 1936)
Yue Minjun (b. 1962)
Albert Oehlen (b. 1954)
Antony Gormley (b. 1950)
Xue Mang (b. 1956)
Shang Yang (b. 1942)
David Hammons (b. 1943)
Chung Sang-Hwa (b. 1932)
Zhou Yansheng (b. 1942)
Liu Wenxi (b. 1933)
Julie Mehretu (b. 1970)
Fan Yang (b. 1955)
Lin Yong (b. 1942)
Richard Serra (b. 1939)
Marc Quinn (b. 1964)
Banksy (b. 1974)
Tian Liming (b. 1955)
Xu Lele (b. 1955)
Wang Hualqing (b. 1944)
Sterling Ruby (b. 1972)
Yang Zhiguang (b. 1930)
Wang Guangyi (b. 1957)
Neo Rauch (b. 1960)
Gilbert & George
Robert Longo (b. 1953)
Marlene Dumas (b. 1953)
Ren Zhong (b. 1976)
Bridget Riley (b. 1931)
Howard Terpning (b. 1927)
Tauba Auerbach (b. 1981)


Let's start with the first 10; skip Cul Ruzhuo who doesn't exist on artnet.

In [121]:
import re
import string

names = "Gerhard Richter (b. 1932) Jeff Koons (b. 1955) Christopher Wool (b. 1955) Cul Ruzhuo (b. 1944) Zeng Fanzhi (b. 1964) Yayoi Kusama (b. 1929) Richard Prince (b. 1949) Peter Doig (b. 1959) Fan Zeng (b. 1938) Ed Ruscha (b. 1937) Damien Hirst (b. 1965) Zhou Chunya (b. 1955) Zhang Xiaogang (b. 1958) Robert Ryman (b. 1930) Wayne Thiebaud (b. 1920) He Jiaying (b. 1957) Huang Yongyu (b. 1924) Fernando Bolero (b. 1932) Frank Stella (b. 1936) Rudolf Stingel (b. 1956) Jasper Johns (b. 1930) Yoshitomo Nara (b. 1959) Liu Wei (b. 1965) Ju Ming (b. 1938) Pierre Soulages (b. 1919) Georg Baselitz (b. 1938) Lee Wan (b. 1936) David Hockney (b. 1937) Anselm Kiefer (b. 1945) Enrico Castellani (b. 1930) Cindy Sherman (b. 1954) Anish Kapoor (b. 1954) Mark Grotjahn (b. 1968) Takashi Murakaml (b. 1962) Luo Zhongli (b. 1949) Robert Indiana (b. 1928) Andreas Gursky (b. 1955) Michelangelo Pistoletto (b. 1933) Wade Guyton (b. 1972) Wang Yidong (b. 1955) Chen Peiqiu (b. 1923) Brice Marden (b. 1938) Jin Shangyi (b. 1934) Fang Lijun (b. 1963) Frank Auerbach (b. 1931) Yang Feiyun (b. 1954) Wang Mingming (b. 1952) Shi Guoliang (b. 1956) Liu DaWei (b. 1945) Mark Tansey (b. 1949) Gunther Uecker (b. 1930) Jia Youfu (b. 1942) Mark Bradford (b. 1961) Ai Xuan (b. 1947) Liu Ye (b. 1964) Maurizio Cahalan (b. 1960) Syed Haider Raza (b. 1922) George Condo (b. 1957) John Currin (b. 1962) Glenn Brown (b. 1966) Liu Xiaodong (b. 1963) Urs Fischer (b. 1973) Miquel Barcelo (b. 1957) Thomas Schutte (b. 1954) Liu Kuo Sung (b. 1932) Glenn Ligon (b. 1960) Ai Weiwei (b. 1957) Sean Scully (b. 1945) Fang Chuxiong (b. 1950) Bruce Nauman (b. 1941) Wang Ziwu (b. 1936) Yue Minjun (b. 1962) Albert Oehlen (b. 1954) Antony Gormley (b. 1950) Xue Mang (b. 1956) Shang Yang (b. 1942) David Hammons (b. 1943) Chung Sang-Hwa (b. 1932) Zhou Yansheng (b. 1942) Liu Wenxi (b. 1933) Julie Mehretu (b. 1970) Fan Yang (b. 1955) Lin Yong (b. 1942) Richard Serra (b. 1939) Marc Quinn (b. 1964) Banksy (b. 1974) Tian Liming (b. 1955) Xu Lele (b. 1955) Wang Hualqing (b. 1944) Sterling Ruby (b. 1972) Yang Zhiguang (b. 1930) Wang Guangyi (b. 1957) Neo Rauch (b. 1960) Gilbert & George Robert Longo (b. 1953) Marlene Dumas (b. 1953) Ren Zhong (b. 1976) Bridget Riley (b. 1931) Howard Terpning (b. 1927) Tauba Auerbach (b. 1981) "
names = names.lower()
names = names.replace(' ','-')
names = re.split('[-]+\(([^)]*)\)+[-]', names)
names = list(filter(lambda x: not x.startswith('b.'), names))

#Convert the list of names in a list of urls
artnet = 'http://www.artnet.com/artists/'
urls = [artnet + h for h in names]

#print(names)
print (urls)


['gerhard-richter', 'jeff-koons', 'christopher-wool', 'cul-ruzhuo', 'zeng-fanzhi', 'yayoi-kusama', 'richard-prince', 'peter-doig', 'fan-zeng', 'ed-ruscha', 'damien-hirst', 'zhou-chunya', 'zhang-xiaogang', 'robert-ryman', 'wayne-thiebaud', 'he-jiaying', 'huang-yongyu', 'fernando-bolero', 'frank-stella', 'rudolf-stingel', 'jasper-johns', 'yoshitomo-nara', 'liu-wei', 'ju-ming', 'pierre-soulages', 'georg-baselitz', 'lee-wan', 'david-hockney', 'anselm-kiefer', 'enrico-castellani', 'cindy-sherman', 'anish-kapoor', 'mark-grotjahn', 'takashi-murakaml', 'luo-zhongli', 'robert-indiana', 'andreas-gursky', 'michelangelo-pistoletto', 'wade-guyton', 'wang-yidong', 'chen-peiqiu', 'brice-marden', 'jin-shangyi', 'fang-lijun', 'frank-auerbach', 'yang-feiyun', 'wang-mingming', 'shi-guoliang', 'liu-dawei', 'mark-tansey', 'gunther-uecker', 'jia-youfu', 'mark-bradford', 'ai-xuan', 'liu-ye', 'maurizio-cahalan', 'syed-haider-raza', 'george-condo', 'john-currin', 'glenn-brown', 'liu-xiaodong', 'urs-fischer', '

In [128]:
# Web scraping, pickle imports
import requests 
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_bio(url):
    '''Returns bio data specifically from artnet.com.'''
    try:
        page = requests.get(url).text
        soup = BeautifulSoup(page, "lxml")
        text = [div.text for div in soup.find(class_="bio").find_all('div')]
    except AttributeError:
        text = "Not on artnet"    

    print(url)
    return text


# URLs of bios in scope;
#urls = ['http://www.artnet.com/artists/gerhard-richter/',
#        'http://www.artnet.com/artists/jeff-koons/',
#        'http://www.artnet.com/artists/christopher-wool/',
#        'http://www.artnet.com/artists/zeng-fanzhi/',
#        'http://www.artnet.com/artists/yayoi-kusama/',
#        'http://www.artnet.com/artists/richard-prince/',
#        'http://www.artnet.com/artists/peter-doig/',
#        'http://www.artnet.com/artists/fan-zeng/',
#        'http://www.artnet.com/artists/ed-ruscha/',
#        'http://www.artnet.com/artists/damien-hirst/',
#        'http://www.artnet.com/artists/zhou-chunya/',
#        'http://www.artnet.com/artists/zhang-xiaogang/']

# artist names, back to normal
artists = [a.replace('-', ' ') for a in names]

In [129]:
# # Actually request transcripts (takes a few minutes to run)
bios = [url_to_bio(u) for u in urls]

http://www.artnet.com/artists/gerhard-richter
http://www.artnet.com/artists/jeff-koons
http://www.artnet.com/artists/christopher-wool
http://www.artnet.com/artists/cul-ruzhuo
http://www.artnet.com/artists/zeng-fanzhi
http://www.artnet.com/artists/yayoi-kusama
http://www.artnet.com/artists/richard-prince
http://www.artnet.com/artists/peter-doig
http://www.artnet.com/artists/fan-zeng
http://www.artnet.com/artists/ed-ruscha
http://www.artnet.com/artists/damien-hirst
http://www.artnet.com/artists/zhou-chunya
http://www.artnet.com/artists/zhang-xiaogang
http://www.artnet.com/artists/robert-ryman
http://www.artnet.com/artists/wayne-thiebaud
http://www.artnet.com/artists/he-jiaying
http://www.artnet.com/artists/huang-yongyu
http://www.artnet.com/artists/fernando-bolero
http://www.artnet.com/artists/frank-stella
http://www.artnet.com/artists/rudolf-stingel
http://www.artnet.com/artists/jasper-johns
http://www.artnet.com/artists/yoshitomo-nara
http://www.artnet.com/artists/liu-wei
http://www.ar

Yess, there are two links which should not be there or should have been split differently, but something to do for another time:
http://www.artnet.com/artists/gilbert-&-george-robert-longo should be gilbert&george and robert longo, two different artists/groups
and the last one: http://www.artnet.com/artists/ may need to remove it, which could do later on.

In [132]:
# # Pickle files for later use

# # Make a new directory to hold the text files
#import os

!mkdir artist_bios
artists = [a.replace(' ', '_') for a in names]

for i, c in enumerate(artists):
     with open("artist_bios/" + c + ".txt", "wb") as file:
         pickle.dump(bios[i], file)

In [133]:
# Load pickled files
data = {}
for i, c in enumerate(artists):
    with open("artist_bios/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [134]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['gerhard-richter', 'jeff-koons', 'christopher-wool', 'cul-ruzhuo', 'zeng-fanzhi', 'yayoi-kusama', 'richard-prince', 'peter-doig', 'fan-zeng', 'ed-ruscha', 'damien-hirst', 'zhou-chunya', 'zhang-xiaogang', 'robert-ryman', 'wayne-thiebaud', 'he-jiaying', 'huang-yongyu', 'fernando-bolero', 'frank-stella', 'rudolf-stingel', 'jasper-johns', 'yoshitomo-nara', 'liu-wei', 'ju-ming', 'pierre-soulages', 'georg-baselitz', 'lee-wan', 'david-hockney', 'anselm-kiefer', 'enrico-castellani', 'cindy-sherman', 'anish-kapoor', 'mark-grotjahn', 'takashi-murakaml', 'luo-zhongli', 'robert-indiana', 'andreas-gursky', 'michelangelo-pistoletto', 'wade-guyton', 'wang-yidong', 'chen-peiqiu', 'brice-marden', 'jin-shangyi', 'fang-lijun', 'frank-auerbach', 'yang-feiyun', 'wang-mingming', 'shi-guoliang', 'liu-dawei', 'mark-tansey', 'gunther-uecker', 'jia-youfu', 'mark-bradford', 'ai-xuan', 'liu-ye', 'maurizio-cahalan', 'syed-haider-raza', 'george-condo', 'john-currin', 'glenn-brown', 'liu-xiaodong', 'urs-f

In [135]:
# More checks
data['gerhard-richter'][:2]

['Gerhard Richter is a contemporary German painter considered among the most influential living artists. Richter’s experiments with abstraction and photo-based painting greatly contributed to the history of the medium. Culling from his vast image archive known as the Atlas, Richter’s paintings reference images of his daughter Betty, flickering candles, aerial photographs, portraits of criminals, and pastoral landscapes. “Pictures are the idea in visual or pictorial form,” he reflected. “And the idea has to be legible, both in the individual picture and in the collective context.” Born on February 9, 1932 in Dresden, Germany during the rise of the Nazi regime. After World War II, living in East Germany under Soviet rule, Richter learned to produce of highly realistic Socialist Realist murals. In 1961, Richter fled to West Germany, where he studied at the Kunstakademie Düsseldorf alongside Sigmar Polke. During this time, the artist first began producing blurred photo-paintings. The works

## Cleaning The Data

My data is pretty clean, so i don't need to remove or add anything to it. I wanted to note here a big difference between my thinking/understanding and Alice's (adashofdata) suggestions around cleaning. What she means by cleaning, is actually mostly pre-processing, and it is incorrect to assume that these steps won't affect your analysis. For example, she looks at comedians' skits, where the use of pronouns could be a major difference, especially if you wanted to understand whether the comedian mostly identifies with the jokes or writes jokes about others. 

In my case, I am looking at bios for artists. I don't want to remove the capitals, because they would denote the name of their works etc, which could be useful to extract. Also, I don't want to remove numbers, as these may be related to years. 

Some **pre-processing** steps could be useful. At this point, I don't know which ones. I am firstly interested in two things:
- try word embeddings
- try to extract gender information
- try to extract other named entities, like artist origin, or places they've potentially been to

In [136]:
# Let's take a look at our data again
next(iter(data.keys()))

'gerhard-richter'

In [137]:
# Notice that our dictionary is currently in key: artist, value: list of text format
next(iter(data.values()))

['Gerhard Richter is a contemporary German painter considered among the most influential living artists. Richter’s experiments with abstraction and photo-based painting greatly contributed to the history of the medium. Culling from his vast image archive known as the Atlas, Richter’s paintings reference images of his daughter Betty, flickering candles, aerial photographs, portraits of criminals, and pastoral landscapes. “Pictures are the idea in visual or pictorial form,” he reflected. “And the idea has to be legible, both in the individual picture and in the collective context.” Born on February 9, 1932 in Dresden, Germany during the rise of the Nazi regime. After World War II, living in East Germany under Soviet rule, Richter learned to produce of highly realistic Socialist Realist murals. In 1961, Richter fled to West Germany, where he studied at the Kunstakademie Düsseldorf alongside Sigmar Polke. During this time, the artist first began producing blurred photo-paintings. The works

In [138]:
# We are going to change this to key: artist, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [139]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [140]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['bios']
data_df = data_df.sort_index()
data_df

Unnamed: 0,bios
,N o t o n a r t n e t
ai-weiwei,Ai Weiwei is one of the best known artists working today. His provocative blend of Chinese history and tradition within a wholly contemporary prac...
ai-xuan,N o t o n a r t n e t
albert-oehlen,N o t o n a r t n e t
andreas-gursky,Andreas Gursky is a German artist known for his large-scale digitally manipulated images. Similar in scope to early 19th-century landscape paintin...
...,...
yue-minjun,"Yue Minjun is a contemporary Chinese artist known for his inventive take on self-portraiture. Grouped into the Cynical Realism movement in China, ..."
zeng-fanzhi,"Zeng Fanzhi is a contemporary Chinese painter. Known for his Expressionist and psychologically probing portraits, he depicts human faces with vivi..."
zhang-xiaogang,"Zhang Xiaogang is a Chinese painter and preeminent member of the contemporary Chinese avant-garde. His Surrealist-inspired, stylized portraits exe..."
zhou-chunya,"Zhou Chunya is a Chinese painter best known for his Green Dog series of paintings. In this wide-ranging body of work, Zhou depicts a bright green ..."


In [141]:
# Apply a first round of text cleaning techniques; already imported re and string
#import re
#import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    #text = text.lower()
    #text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    #text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [143]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.bios.apply(round1))
data_clean

Unnamed: 0,bios
,N o t o n a r t n e t
ai-weiwei,Ai Weiwei is one of the best known artists working today His provocative blend of Chinese history and tradition within a wholly contemporary pract...
ai-xuan,N o t o n a r t n e t
albert-oehlen,N o t o n a r t n e t
andreas-gursky,Andreas Gursky is a German artist known for his largescale digitally manipulated images Similar in scope to early 19thcentury landscape paintings ...
...,...
yue-minjun,Yue Minjun is a contemporary Chinese artist known for his inventive take on selfportraiture Grouped into the Cynical Realism movement in China alo...
zeng-fanzhi,Zeng Fanzhi is a contemporary Chinese painter Known for his Expressionist and psychologically probing portraits he depicts human faces with vivid ...
zhang-xiaogang,Zhang Xiaogang is a Chinese painter and preeminent member of the contemporary Chinese avantgarde His Surrealistinspired stylized portraits execute...
zhou-chunya,Zhou Chunya is a Chinese painter best known for his Green Dog series of paintings In this wideranging body of work Zhou depicts a bright green Ger...


**NOTE:** This data cleaning aka text pre-processing step could go on for a while, but we are going to stop for now. After going through some analysis techniques, if you see that the results don't make sense or could be improved, you can come back and make more edits such as:
* Mark 'cheering' and 'cheer' as the same word (stemming / lemmatization)
* Combine 'thank you' into one term (bi-grams)
* And a lot more...

## Organizing The Data

I mentioned earlier that the output of this notebook will be clean, organized data in two standard text formats:
1. **Corpus - **a collection of text
2. **Document-Term Matrix - **word counts in matrix format

### Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [144]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,bios
,N o t o n a r t n e t
ai-weiwei,Ai Weiwei is one of the best known artists working today. His provocative blend of Chinese history and tradition within a wholly contemporary prac...
ai-xuan,N o t o n a r t n e t
albert-oehlen,N o t o n a r t n e t
andreas-gursky,Andreas Gursky is a German artist known for his large-scale digitally manipulated images. Similar in scope to early 19th-century landscape paintin...
...,...
yue-minjun,"Yue Minjun is a contemporary Chinese artist known for his inventive take on self-portraiture. Grouped into the Cynical Realism movement in China, ..."
zeng-fanzhi,"Zeng Fanzhi is a contemporary Chinese painter. Known for his Expressionist and psychologically probing portraits, he depicts human faces with vivi..."
zhang-xiaogang,"Zhang Xiaogang is a Chinese painter and preeminent member of the contemporary Chinese avant-garde. His Surrealist-inspired, stylized portraits exe..."
zhou-chunya,"Zhou Chunya is a Chinese painter best known for his Green Dog series of paintings. In this wide-ranging body of work, Zhou depicts a bright green ..."


In [145]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [None]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

In [None]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [None]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

## Additional Exercises

1. Can you add an additional regular expression to the clean_text_round2 function to further clean the text?
2. Play around with CountVectorizer's parameters. What is ngram_range? What is min_df and max_df?