- 爬虫采用python的scrapy + BeautifulSoup解析
- ElasticSearch请在ES官网下载,采用客户端py-elasticsearch
- 歌词网站来源:酷我
- web使用flask, Python Web (进行中)
a) 歌手获取分析 从js中找到分页请求的url, var b = host + "/artist/indexAjax?category=" + index + "&prefix=" + $("#artistContent").attr("data-letter") + "&pn=" + pn; 如http://www.kuwo.cn/artist/indexAjax?category=0&prefix=&pn=5 构造url 参数pn为当前页码 范围 (pn:0-6947)
b)歌词分页获取分析 如:http://www.kuwo.cn//artist/contentMusicsAjax?artistId=2&pn=1&rn=100 其中artistId为歌手id,pn为分页参数 接下来循环遍历即可
c)歌词获取分析 每首歌歌词的详情页 http://www.kuwo.cn/yinyue/6468891 注意:此处有时因为无版权,无法显示歌词,需要增加异常处理
启动方法 : 进入目录 cd kuwolyc
执行scrapy crawl lyc
Docker页面:https://hub.docker.com/_/elasticsearch/
中文参考 http://kael-aiur.com/docker/%E5%9C%A8docker%E4%B8%8A%E8%BF%90%E8%A1%8Celasticsearch.html
https://github.com/medcl/elasticsearch-analysis-ik IK Analysis for Elasticsearch
{
"properties": {
"album": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"href": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"lyric": {
"type": "text",
"analyzer": "ik_smart"
},
"name": {
"type": "text",
"analyzer": "ik_max_word"
},
"singer": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
导入脚本 json2es.py
# 进入web目录
cd web
# 配置flask环境
pip install -r requirements.txt
# 运行web app
python run.py --port 8085